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Preface 


conometrics can be a fun course for both teacher and student. The real world 
ae economics, business, and government is a complicated and messy place, full 
of competing ideas and questions that demand answers. Does healthcare spending 
actually improve health outcomes? Can you make money in the stock market by 
buying when prices are historically low, relative to earnings, or should you just sit 
tight, as the random walk theory of stock prices suggests? Does heavy intake of cof- 
fee lower the risk of disease or death? Econometrics helps us sort out sound ideas 
from crazy ones and find quantitative answers to important quantitative questions. 
Econometrics opens a window on our complicated world that lets us see the relation- 
ships on which people, businesses, and governments base their decisions. 

Introduction to Econometrics is designed for a first course in undergraduate 
econometrics. It is our experience that to make econometrics relevant in an introduc- 
tory course, interesting applications must motivate the theory and the theory must 
match the applications. This simple principle represents a significant departure from 
the older generation of econometrics books, in which theoretical models and assump- 
tions do not match the applications. It is no wonder that some students question the 
relevance of econometrics after they spend much of their time learning assumptions 
that they subsequently realize are unrealistic so that they must then learn “solutions” 
to “problems” that arise when the applications do not match the assumptions. We 
believe that it is far better to motivate the need for tools with a concrete application 
and then to provide a few simple assumptions that match the application. Because 
the methods are immediately relevant to the applications, this approach can make 
econometrics come alive. 

To improve student results, we recommend pairing the text content with MyLab 
Economics, which is the teaching and learning platform that empowers you to reach 
every student. By combining trusted author content with digital tools and a flexible 
platform, MyLab personalizes the learning experience and will help your students 
learn and retain key course concepts while developing skills that future employers 
are seeking in their candidates. MyLab Economics helps you teach your course, your 
way. Learn more at www.pearson.com/mylab/economics. 


New To This Edition 


e New chapter on “Big Data” and machine learning 


e Forecasting in time series data with large data sets 
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e Dynamic factor models 
e Parallel treatment of prediction and causal inference using regression 


e Coverage of realized volatility as well as autoregressive conditional heteroske- 
dasticity 


e Updated discussion of weak instruments 


Very large data sets are increasingly being used in economics and related fields. 
Applications include predicting consumer choices, measuring the quality of hospitals 
or schools, analyzing nonstandard data such as text data, and macroeconomic fore- 
casting with many variables. The three main additions in this edition incorporate the 
fundamentals of this growing and exciting area of application. 

First, we have a new chapter (Chapter 14) that focuses on big data and machine 
learning methods. Within economics, many of the applications to date have focused 
on the so called many-predictor problem, where the number of predictors is large rel- 
ative to the sample size — perhaps even exceeding the sample size. With many predic- 
tors, ordinary least squares (OLS) provides poor predictions, and other methods, such 
as the LASSO, can have much lower out-of-sample prediction errors. This chapter 
goes over the concepts of out-of-sample prediction, why OLS performs poorly, and 
how shrinkage can improve upon OLS. The chapter introduces shrinkage methods 
and prediction using principal components, shows how to choose tuning parameters 
by cross-validation, and explains how these methods can be used to analyze nonstan- 
dard data such as text data. As usual, this chapter has a running empirical example, 
in this case, prediction of school-level test scores given school-level characteristics, 
for California elementary schools. 

Second, in Chapter 17 (newly renumbered), we extend the many-predictor focus 
of Chapter 14 to time series data. Specifically, we show how the dynamic factor model 
can handle a very large number of time series, and show how to implement the 
dynamic factor model using principal components analysis. We illustrate the dynamic 
factor model and its use for forecasting with a 131-variable dataset of U.S. quarterly 
macroeconomic time series. 

Third, we now lay out these two uses of regression—causal inference and 
prediction—up front, when regression is first introduced in Chapter 4. Regression 
is a Statistical tool that can be used to make causal inferences or to make predic- 
tions; the two applications place different demands on how the data are collected. 
When the data are from a randomized controlled experiment, OLS estimates the 
causal effect. In observational data, if we are interested in estimating the causal 
effect, then the econometrician needs to use control variables and/or instruments 
to produce as-if randomization of the variable of interest. In contrast, for predic- 
tion, one is not interested in the causal effect so one does not need as-if random 
variation; however, the estimation (“training”) data set must be drawn from the 
same population as the observations for which one wishes to make the prediction. 
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This edition has several smaller changes. For example, we now introduce realized 
volatility as a complement to the GARCH model when analyzing time series data 
with volatility clustering. In addition, we now extend the discussion (in a new general 
interest box) of the historical origins of instrumental variables regression in Chapter 
12. This treatment now includes a first-ever reproduction of the original derivation 
of the IV estimator, which was in a letter from Philip Wright to his son Sewall in the 
spring of 1926, and a discussion of the first IV regression, an estimate of the elasticity 
of supply of flaxseed. 


Solving Teaching and Learning Challenges 


Introduction to Econometrics differs from other texts in three main ways. First, we 
integrate real-world questions and data into the development of the theory, and we 
take seriously the substantive findings of the resulting empirical analysis. Second, 
our choice of topics reflects modern theory and practice. Third, we provide theory 
and assumptions that match the applications. Our aim is to teach students to become 
sophisticated consumers of econometrics and to do so at a level of mathematics 
appropriate for an introductory course. 


Real-World Questions and Data 


We organize each methodological topic around an important real-world question 
that demands a specific numerical answer. For example, we teach single-variable 
regression, multiple regression, and functional form analysis in the context of 
estimating the effect of school inputs on school outputs. (Do smaller elementary 
school class sizes produce higher test scores?) We teach panel data methods in 
the context of analyzing the effect of drunk driving laws on traffic fatalities. We 
use possible racial discrimination in the market for home loans as the empirical 
application for teaching regression with a binary dependent variable (logit and 
probit). We teach instrumental variable estimation in the context of estimating 
the demand elasticity for cigarettes. Although these examples involve economic 
reasoning, all can be understood with only a single introductory course in econom- 
ics, and many can be understood without any previous economics coursework. 
Thus the instructor can focus on teaching econometrics, not microeconomics or 
macroeconomics. 

We treat all our empirical applications seriously and in a way that shows stu- 
dents how they can learn from data but at the same time be self-critical and aware 
of the limitations of empirical analyses. Through each application, we teach stu- 
dents to explore alternative specifications and thereby to assess whether their sub- 
stantive findings are robust. The questions asked in the empirical applications are 
important, and we provide serious and, we think, credible answers. We encourage 
students and instructors to disagree, however, and invite them to reanalyze the 
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data, which are provided on the text’s Companion Website (www.pearsonglobaleditions 
-com) and in MyLab Economics. 

Throughout the text, we have focused on helping students understand, retain, 
and apply the essential ideas. Chapter introductions provide real-world grounding 
and motivation, as well as brief road maps highlighting the sequence of the discus- 
sion. Key terms are boldfaced and defined in context throughout each chapter, and 
Key Concept boxes at regular intervals recap the central ideas. General interest 
boxes provide interesting excursions into related topics and highlight real-world 
studies that use the methods or concepts being discussed in the text. A Summary 
concluding each chapter serves as a helpful framework for reviewing the main 
points of coverage. 

Available for student practice or instructor assignment in MyLab 
Economics are Review the Concepts questions, Exercises, and Empirical 
Exercises from the text. These questions and exercises are auto-graded, giv- 
ing students practical hands-on experience with solving problems using the 
data sets used in the text. 


e 100 percent of Review the Concepts questions are available in MyLab. 


e Select Exercises and Empirical Exercises are available in MyLab. Many of the 
Empirical Exercises are algorithmic and based on the data sets used in the text. 
These exercises require students to use Excel or an econometrics software pack- 
age to analyze the data and derive results. 


e New to the 4" edition are concept exercises that focus on core concepts and 
economic interpretations. Many are algorithmic and include the Help Me Solve 
This learning aid. 


Contemporary Choice of Topics 


The topics we cover reflect the best of contemporary applied econometrics. One can 
only do so much in an introductory course, so we focus on procedures and tests that 
are commonly (or increasingly) used in practice. For example: 


e Instrumental variables regression. We present instrumental variables regres- 
sion as a general method for handling correlation between the error term 
and a regressor, which can arise for many reasons, including omitted variables 
and simultaneous causality. The two assumptions for a valid instrument— 
exogeneity and relevance —are given equal billing. We follow that presentation 
with an extended discussion of where instruments come from and with tests of 
overidentifying restrictions and diagnostics for weak instruments, and we explain 
what to do if these diagnostics suggest problems. 


e Program evaluation. Many modern econometric studies analyze either ran- 
domized controlled experiments or quasi-experiments, also known as natural 
experiments. We address these topics, often collectively referred to as program 
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evaluation, in Chapter 13. We present this research strategy as an alternative 
approach to the problems of omitted variables, simultaneous causality, and 
selection, and we assess both the strengths and the weaknesses of studies using 
experimental or quasi-experimental data. 


e Prediction with “big data.” Chapter 14 takes up the opportunities and 
challenges posed by large cross-sectional data sets. An increasingly common 
application in econometrics is making predictions when the number of pre- 
dictors is very large. This chapter focuses on methods designed to use many 
predictors in a way that produces accurate and precise out-of-sample predic- 
tions. The chapter covers some of the building blocks of machine learning, 
and the methods can substantially improve upon OLS when the number of 
predictors is large. In addition, these methods extend to nonstandard data, 
such as text data. 


e Forecasting. The chapter on forecasting (Chapter 15) considers univariate 
(autoregressive) and multivariate forecasts using time series regression, not 
large simultaneous equation structural models. We focus on simple and reliable 
tools, such as autoregressions and model selection via an information criterion, 
that work well in practice. This chapter also features a practically oriented treat- 
ment of structural breaks (at known and unknown dates) and pseudo out-of- 
sample forecasting, all in the context of developing stable and reliable time 
series forecasting models. 


e Time series regression. The chapter on causal inference using time series 
data (Chapter 16) pays careful attention to when different estimation 
methods, including generalized least squares, will or will not lead to valid 
causal inferences and when it is advisable to estimate dynamic regressions 
using OLS with heteroskedasticity- and autocorrelation-consistent stan- 
dard errors. 


Theory That Matches Applications 


Although econometric tools are best motivated by empirical applications, students 
need to learn enough econometric theory to understand the strengths and limita- 
tions of those tools. We provide a modern treatment in which the fit between theory 
and applications is as tight as possible, while keeping the mathematics at a level that 
requires only algebra. 

Modern empirical applications share some common characteristics: The data 
sets typically have many observations (hundreds or more); regressors are not fixed 
over repeated samples but rather are collected by random sampling (or some other 
mechanism that makes them random); the data are not normally distributed; and 
there is no a priori reason to think that the errors are homoskedastic (although often 
there are reasons to think that they are heteroskedastic). 
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These observations lead to important differences between the theoretical devel- 
opment in this text and other texts: 


e Large-sample approach. Because data sets are large, from the outset we use 
large-sample normal approximations to sampling distributions for hypothesis 
testing and confidence intervals. In our experience, it takes less time to teach the 
rudiments of large-sample approximations than to teach the Student f and exact 
F distributions, degrees-of-freedom corrections, and so forth. This large-sample 
approach also saves students the frustration of discovering that, because of 
nonnormal errors, the exact distribution theory they just mastered is irrelevant. 
Once taught in the context of the sample mean, the large-sample approach to 
hypothesis testing and confidence intervals carries directly through multiple 
regression analysis, logit and probit, instrumental variables estimation, and time 
series methods. 


e Random sampling. Because regressors are rarely fixed in econometric applica- 
tions, from the outset we treat data on all variables (dependent and indepen- 
dent) as the result of random sampling. This assumption matches our initial 
applications to cross-sectional data, it extends readily to panel and time series 
data, and because of our large-sample approach, it poses no additional concep- 
tual or mathematical difficulties. 


e Heteroskedasticity. Applied econometricians routinely use heteroskedasticity- 
robust standard errors to eliminate worries about whether heteroskedasticity is 
present or not. In this book, we move beyond treating heteroskedasticity as an 
exception or a “problem” to be “solved”; instead, we allow for heteroskedastic- 
ity from the outset and simply use heteroskedasticity-robust standard errors. We 
present homoskedasticity as a special case that provides a theoretical motivation 
for OLS. 


Skilled Producers, Sophisticated Consumers 


We hope that students using this book will become sophisticated consumers of 
empirical analysis. To do so, they must learn not only how to use the tools of regres- 
sion analysis but also how to assess the validity of empirical analyses presented to 
them. 

Our approach to teaching how to assess an empirical study is threefold. First, 
immediately after introducing the main tools of regression analysis, we devote 
Chapter 9 to the threats to internal and external validity of an empirical study. This 
chapter discusses data problems and issues of generalizing findings to other settings. 
It also examines the main threats to regression analysis, including omitted variables, 
functional form misspecification, errors-in-variables, selection, and simultaneity — 
and ways to recognize these threats in practice. 
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Second, we apply these methods for assessing empirical studies to the empirical 
analysis of the ongoing examples in the book. We do so by considering alternative 
specifications and by systematically addressing the various threats to validity of the 
analyses presented in the book. 

Third, to become sophisticated consumers, students need firsthand experience 
as producers. Active learning beats passive learning, and econometrics is an ideal 
course for active learning. For this reason, the MyLab Economics and text web- 
site feature data sets, software, and suggestions for empirical exercises of different 
scopes. 


Approach to Mathematics and Level of Rigor 


Our aim is for students to develop a sophisticated understanding of the tools of 
modern regression analysis, whether the course is taught at a “high” or a “low” level 
of mathematics. Parts I through IV of the text (which cover the substantive material) 
are written for students with only precalculus mathematics. Parts I through IV have 
fewer equations and more applications than many introductory econometrics books 
and far fewer equations than books aimed at mathematical sections of undergradu- 
ate courses. But more equations do not imply a more sophisticated treatment. In our 
experience, a more mathematical treatment does not lead to a deeper understanding 
for most students. 

That said, different students learn differently, and for mathematically well- 
prepared students, learning can be enhanced by a more explicit mathematical 
treatment. The appendices in Parts I-IV therefore provide key calculations that 
are too involved to be included in the text. In addition, Part V contains an intro- 
duction to econometric theory that is appropriate for students with a stronger 
mathematical background. When the mathematical chapters in Part V are used 
in conjunction with the material in Parts I through IV (including appendices), 
this book is suitable for advanced undergraduate or master’s level econometrics 
courses. 


Developing Career Skills 


For students to succeed in a rapidly changing job market, they should be aware 
of their career options and how to go about developing a variety of skills. Data 
analysis is an increasingly marketable skill. This text prepares the students for 
a range of data analytic applications, including causal inference and prediction. 
It also introduces the students to the core concepts of prediction using large 
data sets. 
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Table of Contents Overview 


There are five parts to Introduction to Econometrics. This text assumes that the stu- 
dent has had a course in probability and statistics, although we review that material 
in Part I. We cover the core material of regression analysis in Part II. Parts III, IV, and 
V present additional topics that build on the core treatment in Part II. 


Part | 


Chapter 1 introduces econometrics and stresses the importance of providing quanti- 
tative answers to quantitative questions. It discusses the concept of causality in sta- 
tistical studies and surveys the different types of data encountered in econometrics. 
Material from probability and statistics is reviewed in Chapters 2 and 3, respectively; 
whether these chapters are taught in a given course or are simply provided as a refer- 
ence depends on the background of the students. 


Part Il 


Chapter 4 introduces regression with a single regressor and ordinary least squares 
(OLS) estimation, and Chapter 5 discusses hypothesis tests and confidence intervals 
in the regression model with a single regressor. In Chapter 6, students learn how they 
can address omitted variable bias using multiple regression, thereby estimating the 
effect of one independent variable while holding other independent variables con- 
stant. Chapter 7 covers hypothesis tests, including F-tests, and confidence intervals in 
multiple regression. In Chapter 8, the linear regression model is extended to models 
with nonlinear population regression functions, with a focus on regression functions 
that are linear in the parameters (so that the parameters can be estimated by OLS). In 
Chapter 9, students step back and learn how to identify the strengths and limitations 
of regression studies, seeing in the process how to apply the concepts of internal and 
external validity. 


Part III 


Part III presents extensions of regression methods. In Chapter 10, students learn 
how to use panel data to control for unobserved variables that are constant over 
time. Chapter 11 covers regression with a binary dependent variable. Chapter 12 
shows how instrumental variables regression can be used to address a variety of 
problems that produce correlation between the error term and the regressor, and 
examines how one might find and evaluate valid instruments. Chapter 13 introduces 
students to the analysis of data from experiments and quasi-, or natural, experiments, 
topics often referred to as “program evaluation.” Chapter 14 turns to econometric 
issues that arise with large data sets, and focuses on prediction when there are very 
many predictors. 


Preface 35 


Part IV 


Part IV takes up regression with time series data. Chapter 15 focuses on forecasting 
and introduces various modern tools for analyzing time series regressions, such as 
tests for stability. Chapter 16 discusses the use of time series data to estimate causal 
relations. Chapter 17 presents some more advanced tools for time series analysis, 
including models of volatility clustering and dynamic factor models. 


Part V 


Part V is an introduction to econometric theory. This part is more than an appendix 
that fills in mathematical details omitted from the text. Rather, it is a self-contained 
treatment of the econometric theory of estimation and inference in the linear regres- 
sion model. Chapter 18 develops the theory of regression analysis for a single regres- 
sor; the exposition does not use matrix algebra, although it does demand a higher 
level of mathematical sophistication than the rest of the text. Chapter 19 presents the 
multiple regression model, instrumental variables regression, generalized method of 
moments estimation of the linear model, and principal components analysis, all in 
matrix form. 


Prerequisites Within the Book 


Because different instructors like to emphasize different material, we wrote this 
book with diverse teaching preferences in mind. To the maximum extent possible, 
the chapters in Parts III, IV, and V are “stand-alone” in the sense that they do not 
require first teaching all the preceding chapters. The specific prerequisites for 
each chapter are described in Table I. Although we have found that the sequence 
of topics adopted in the text works well in our own courses, the chapters are writ- 
ten in a way that allows instructors to present topics in a different order if they 
so desire. 
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Sample Courses 


This book accommodates several different course structures. 


e A ae : : 5 N 
TABLE I Guide to Prerequisites for Special-Topic Chapters in Parts III, IV, and V 
Prerequisite parts or chapters 
Part | Part Il Part III Part IV Part V 
10.1, 12.1, 

Chapter 1-3 4-7,9 8 10.2 12.2 15.1-15.4 15.5-15.8 16 18 

10 x? x? X 

11 x? x? X 
12.1, 12.2 x? x? X 
12.3-12.6 x? x? X 

13 x? x? X 

14 x° x? X 

15 x? x? b 

16 x? x? b 

17 x? x? b X X X 

18 

19 X X X X X 
This table shows the minimum prerequisites needed to cover the material in a given chapter. For example, estimation of 
dynamic causal effects with time series data (Chapter 16) first requires Part I (as needed, depending on student preparation, 
and except as noted in footnote a), Part II (except for Chapter 8; see footnote b), and Sections 15.1 through 15.4. 
“Chapters 10 through 17 use exclusively large-sample approximations to sampling distributions, so the optional Sections 3.6 
(the Student ¢ distribution for testing means) and 5.6 (the Student ¢ distribution for testing regression coefficients) can be 
skipped. 
>Chapters 15 through 17 (the time series chapters) can be taught without first teaching Chapter 8 (nonlinear regression 
functions) if the instructor pauses to explain the use of logarithmic transformations to approximate percentage changes. 

he a 


Standard Introductory Econometrics 


This course introduces econometrics (Chapter 1) and reviews probability and sta- 
tistics as needed (Chapters 2 and 3). It then moves on to regression with a single 
regressor, multiple regression, the basics of functional form analysis, and the evalua- 
tion of regression studies (all Part II). The course proceeds to cover regression with 
panel data (Chapter 10), regression with a limited dependent variable (Chapter 11), 
and instrumental variables regression (Chapter 12), as time permits. The course then 
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turns to experiments and quasi-experiments in Chapter 13, topics that provide an 
opportunity to return to the questions of estimating causal effects raised at the begin- 
ning of the semester and to recapitulate core regression methods. If there is time, 
the students can be introduced to big data and machine learning methods at the end 
(Chapter 14). Prerequisites: Algebra II and introductory statistics. 


Introductory Econometrics with Time Series 
and Forecasting Applications 


Like a standard introductory course, this course covers all of Part I (as needed) 
and Part II. Optionally, the course next provides a brief introduction to panel data 
(Sections 10.1 and 10.2) and takes up instrumental variables regression (Chapter 
12, or just Sections 12.1 and 12.2). The course then proceeds to Chapter 14 (predic- 
tion in large cross sectional data sets). It then turns to Part IV, covering forecasting 
(Chapter 15) and estimation of dynamic causal effects (Chapter 16). If time permits, 
the course can include some advanced topics in time series analysis such as vola- 
tility clustering (Section 175) and forecasting with many predictors (Section 176). 
Prerequisites: Algebra II and introductory statistics. 


Applied Time Series Analysis and Forecasting 


This book also can be used for a short course on applied time series and forecasting, 
for which a course on regression analysis is a prerequisite. Some time is spent review- 
ing the tools of basic regression analysis in Part II, depending on student preparation. 
The course then moves directly to time series forecasting (Chapter 15), estimation 
of dynamic causal effects (Chapter 16), and advanced topics in time series analysis 
(Chapter 17), including vector autoregressions. If there is time, the course can cover 
prediction using large data sets (Chapter 14 and Section 176), An important compo- 
nent of this course is hands-on forecasting exercises, available as the end-of-chapter 
Empirical Exercises for Chapters 15 and 17. Prerequisites: Algebra II and basic intro- 
ductory econometrics or the equivalent. 


Introduction to Econometric Theory 


This book is also suitable for an advanced undergraduate course in which the stu- 
dents have a strong mathematical preparation or for a master’s level course in 
econometrics. The course briefly reviews the theory of statistics and probability as 
necessary (Part I). The course introduces regression analysis using the nonmath- 
ematical, applications-based treatment of Part II. This introduction is followed by 
the theoretical development in Chapters 18 and 19 (through Section 19.5). The 
course then takes up regression with a limited dependent variable (Chapter 11) 
and maximum likelihood estimation (Appendix 11.2). Next, the course optionally 
turns to instrumental variables regression and generalized method of moments 
(Chapter 12 and Section 19.7), time series methods (Chapter 15), the estimation of 
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causal effects using time series data and generalized least squares (Chapter 16 and 


Section 19.6), and/or to machine learning methods (Chapter 14 and Appendix 19.7). 


Prerequisites: Calculus and introductory statistics. Chapter 18 assumes previous 


exposure to matrix algebra. 


Instructor Teaching Resources 


This program comes with the following teaching resources: 


cr 
Supplements available to instructors at 
www.pearsonglobaleditions.com 


Solutions Manual 


Test Bank 


Authored by Manfred Keil, Claremont 
McKenna College 


Computerized TestGen 


PowerPoints 


Companion Website 


Features of the Supplement 


Solutions to the end-of-chapter content. 


1,000 multiple-choice questions, essays and 
longer questions, and mathematical and graphical 
problems with these annotations: 


e Type (Multiple-choice, essay, graphical) 


TestGen allows instructors to: 


e Customize, save, and generate classroom tests 


e Edit, add, or delete questions from the Test 
Item Files 


e Analyze test results 
e Organize a database of tests and student results. 


Slides include all the graphs, tables, and equations 
in the text. 


PowerPoints meet accessibility standards for 
students with disabilities. Features include, but 
not limited to: 


e Keyboard and Screen Reader access 
e Alternative text for images 


e High color contrast between background and 
foreground colors 


The Companion Website provides a wide range 
of additional resources for students and faculty. 
These resources include more and more in depth 
empirical exercises, data sets for the empirical 
exercises, replication files for empirical results 
reported in the text, and EViews tutorials. 


= — 
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Economic Questions and Data 


1.1 


sk a half dozen econometricians what econometrics is, and you could get a half 

dozen different answers. One might tell you that econometrics is the science of 
testing economic theories. A second might tell you that econometrics is the set of 
tools used for forecasting future values of economic variables, such as a firm’s sales, the 
overall growth of the economy, or stock prices. Another might say that econometrics is 
the process of fitting mathematical economic models to real-world data. A fourth 
might tell you that it is the science and art of using historical data to make numerical, 
or quantitative, policy recommendations in government and business. 

In fact, all these answers are right. At a broad level, econometrics is the science 
and art of using economic theory and statistical techniques to analyze economic data. 
Econometric methods are used in many branches of economics, including finance, 
labor economics, macroeconomics, microeconomics, marketing, and economic policy. 
Econometric methods are also commonly used in other social sciences, including 
political science and sociology. 

This text introduces you to the core set of methods used by econometricians. We 
will use these methods to answer a variety of specific, quantitative questions from the 
worlds of business and government policy. This chapter poses four of those questions 
and discusses, in general terms, the econometric approach to answering them. The 
chapter concludes with a survey of the main types of data available to econometri- 
cians for answering these and other quantitative economic questions. 


Economic Questions We Examine 


Many decisions in economics, business, and government hinge on understanding rela- 
tionships among variables in the world around us. These decisions require quantita- 
tive answers to quantitative questions. 

This text examines several quantitative questions taken from current issues in 
economics. Four of these questions concern education policy, racial bias in mortgage 
lending, cigarette consumption, and macroeconomic forecasting. 


Question #1: Does Reducing Class Size Improve 

Elementary School Education? 

Proposals for reform of the U.S. public education system generate heated debate. 
Many of the proposals concern the youngest students, those in elementary schools. 
Elementary school education has various objectives, such as developing social skills, 
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but for many parents and educators, the most important objective is basic academic 
learning: reading, writing, and basic mathematics. One prominent proposal for 
improving basic learning is to reduce class sizes at elementary schools. With fewer 
students in the classroom, the argument goes, each student gets more of the teacher’s 
attention, there are fewer class disruptions, learning is enhanced, and grades 
improve. 

But what, precisely, is the effect on elementary school education of reducing class 
size? Reducing class size costs money: It requires hiring more teachers and, if the 
school is already at capacity, building more classrooms. A decision maker contem- 
plating hiring more teachers must weigh these costs against the benefits. To weigh 
costs and benefits, however, the decision maker must have a precise quantitative 
understanding of the likely benefits. Is the beneficial effect on basic learning of 
smaller classes large or small? Is it possible that smaller class size actually has no 
effect on basic learning? 

Although common sense and everyday experience may suggest that more learn- 
ing occurs when there are fewer students, common sense cannot provide a quantita- 
tive answer to the question of what exactly is the effect on basic learning of reducing 
class size. To provide such an answer, we must examine empirical evidence—that is, 
evidence based on data—relating class size to basic learning in elementary schools. 

In this text, we examine the relationship between class size and basic learning, 
using data gathered from 420 California school districts in 1999. In the California 
data, students in districts with small class sizes tend to perform better on standardized 
tests than students in districts with larger classes. While this fact is consistent with the 
idea that smaller classes produce better test scores, it might simply reflect many other 
advantages that students in districts with small classes have over their counterparts 
in districts with large classes. For example, districts with small class sizes tend to have 
wealthier residents than districts with large classes, so students in small-class districts 
could have more opportunities for learning outside the classroom. It could be these 
extra learning opportunities that lead to higher test scores, not smaller class sizes. 
In Part II, we use multiple regression analysis to isolate the effect of changes in class 
size from changes in other factors, such as the economic background of the 
students. 


Question #2: Is There Racial Discrimination 
in the Market for Home Loans? 


Most people buy their homes with the help of a mortgage, a large loan secured by the 
value of the home. By law, U.S. lending institutions cannot take race into account when 
deciding to grant or deny a request for a mortgage: Applicants who are identical in all 
ways except their race should be equally likely to have their mortgage applications 
approved. In theory, then, there should be no racial bias in mortgage lending. 

In contrast to this theoretical conclusion, researchers at the Federal Reserve Bank 
of Boston found (using data from the early 1990s) that 28% of black applicants are 


1.1. Economic Questions We Examine 45 


denied mortgages, while only 9% of white applicants are denied. Do these data indi- 
cate that, in practice, there is racial bias in mortgage lending? If so, how large is it? 

The fact that more black than white applicants are denied in the Boston Fed data 
does not by itself provide evidence of discrimination by mortgage lenders because 
the black and white applicants differ in many ways other than their race. Before 
concluding that there is bias in the mortgage market, these data must be examined 
more closely to see if there is a difference in the probability of being denied for 
otherwise identical applicants and, if so, whether this difference is large or small. To 
do so, in Chapter 11 we introduce econometric methods that make it possible to 
quantify the effect of race on the chance of obtaining a mortgage, holding constant 
other applicant characteristics, notably their ability to repay the loan. 


Question #3: Does Healthcare Spending Improve 
Health Outcomes? 


It is self-evident that no one lives forever, but avoidable deaths can be reduced and 
survival can be extended through the provision of healthcare. Healthcare has other 
beneficial effects too, like the improvement of the health-related quality of life of indi- 
viduals. To these ends and more, a vast quantity of resources is devoted to the provision 
of healthcare worldwide. What is more there is enormous variation in the healthcare 
expenditures across countries both in absolute and per capita terms, as well as variations 
in health outcomes across countries, for example measured by life expectancy at birth. 

Putting aside concerns about iatrogenesis (the idea that healthcare is bad for your 
health), basic economics says that more expenditure on healthcare should generally 
reduce avoidable mortality. But by how much? If the amount spent on healthcare 
increases by 1%, by what percentage will avoidable mortality decrease? The percent- 
age change in avoidable mortality resulting from a 1% increase in healthcare expendi- 
ture is the spending elasticity for mortality (analogous to the price elasticity of demand, 
which is the percentage change in quantity demanded from a 1% increase in price). If 
we want to reduce avoidable mortality, say, 20% by increasing healthcare expenditure, 
then we need to know the spending elasticity for mortality to calculate the healthcare 
expenditure increase necessary to achieve this reduction in avoidable mortality. 

A number of policy objectives are based on meeting targets based on avoidable 
mortality; for example, one of the United Nations Development Programme’s sustain- 
able development goals is that all countries should aim to reduce “under-5 mortality to 
at least as low as 25 per 1,000 live births.”! But how should the goal be met: from 
expanding healthcare services or other services? And if increasing healthcare spending 
is to form part of the mix of policies, by how much will it need to increase? The answers 
to these can be obtained with estimates of the spending elasticity for mortality. 


'United Nations Development Programme (UNDP), The Sustainable Development Goals (SDGs): 
Goal 3: Good health and well-being, 2017. 
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While economic theory, such as the production function for health, helps us ana- 
lyze the mix of inputs that may lead to improved health outcomes, it does not tell us 
the actual values for parameters such as the spending elasticity for mortality. To 
estimate the value, we must examine empirical evidence about the returns to health- 
care spending—either based on variations in spending across countries or within 
countries over time (or both). In other words, we need to analyze the data on how 
health outcomes and healthcare expenditures are related. 

For many years economists have attempted to address this question by consider- 
ing the data on healthcare expenditures and mortality rates across countries, but such 
empirical research is fraught with challenges. Two of the biggest challenges concern 
the extensive heterogeneity across countries. The first challenge is observable hetero- 
geneity, which concerns factors that affect countries’ mortality rates that may also be 
associated with healthcare expenditure, for example, the income per capita of each 
country. This can be controlled for using multiple regression analysis, as described in 
Part II, since these factors are observable to the analyst. The second and more trou- 
blesome challenge is the presence of unobservable heterogeneity. Unobserved fac- 
tors may be important in the underlying processes determining both how decisions 
are made on how much money is spent on healthcare, and how the overall level of 
health outcome that is attained. These factors result in causality running in both 
directions—healthcare reduces mortality, but higher healthcare expenditure might 
be a response to unobserved factors, such as small natural disasters that increase 
mortality. Methods for handling this “simultaneous causality” are described in 
Chapter 12, applied to the different but conceptually similar context of estimating 
the price elasticity of cigarette demand. 


Question #4: By How Much Will U.S. GDP Grow 
Next Year? 


It seems that people always want a sneak preview of the future. What will sales be next 
year at a firm that is considering investing in new equipment? Will the stock market go 
up next month, and, if it does, by how much? Will city tax receipts next year cover 
planned expenditures on city services? Will your microeconomics exam next week 
focus on externalities or monopolies? Will Saturday be a nice day to go to the beach? 
One aspect of the future in which macroeconomists are particularly interested is the 
growth of real economic activity, as measured by real gross domestic product (GDP), 
during the next year. A management consulting firm might advise a manufacturing cli- 
ent to expand its capacity based on an upbeat forecast of economic growth. Economists 
at the Federal Reserve Board in Washington, D.C., are mandated to set policy to keep 
real GDP near its potential in order to maximize employment. If they forecast anemic 
GDP growth over the next year, they might expand liquidity in the economy by reduc- 
ing interest rates or other measures, in an attempt to boost economic activity. 
Professional economists who rely on numerical forecasts use econometric mod- 
els to make those forecasts. A forecaster’s job is to predict the future by using the 
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past, and econometricians do this by using economic theory and statistical techniques 
to quantify relationships in historical data. 

The data we use to forecast the growth rate of GDP include past values of GDP 
and the so-called term spread in the United States. The term spread is the difference 
between long-term and short-term interest rates. It measures, among other things, 
whether investors expect short-term interest rates to rise or fall in the future. The 
term spread is usually positive, but it tends to fall sharply before the onset of a reces- 
sion. One of the GDP growth rate forecasts we develop and evaluate in Chapter 15 
is based on the term spread. 


Quantitative Questions, Quantitative Answers 


Each of these four questions requires a numerical answer. Economic theory provides 
clues about that answer—for example, cigarette consumption ought to go down when 
the price goes up—but the actual value of the number must be learned empirically, that 
is, by analyzing data. Because we use data to answer quantitative questions, our answers 
always have some uncertainty: A different set of data would produce a different numer- 
ical answer. Therefore, the conceptual framework for the analysis needs to provide both 
a numerical answer to the question and a measure of how precise the answer is. 

The conceptual framework used in this text is the multiple regression model, the 
mainstay of econometrics. This model, introduced in Part II, provides a mathematical 
way to quantify how a change in one variable affects another variable, holding other 
things constant. For example, what effect does a change in class size have on test 
scores, holding constant or controlling for student characteristics (such as family 
income) that a school district administrator cannot control? What effect does your 
race have on your chances of having a mortgage application granted, holding con- 
stant other factors such as your ability to repay the loan? What effect does a 1% 
increase in the price of cigarettes have on cigarette consumption, holding constant 
the income of smokers and potential smokers? The multiple regression model and 
its extensions provide a framework for answering these questions using data and for 
quantifying the uncertainty associated with those answers. 


Causal Effects and Idealized Experiments 


Like many other questions encountered in econometrics, the first three questions in 
Section 1.1 concern causal relationships among variables. In common usage, an action 
is said to cause an outcome if the outcome is the direct result, or consequence, of that 
action. Touching a hot stove causes you to get burned, drinking water causes you to 
be less thirsty, putting air in your tires causes them to inflate, putting fertilizer on your 
tomato plants causes them to produce more tomatoes. Causality means that a specific 
action (applying fertilizer) leads to a specific, measurable consequence (more 
tomatoes). 
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Estimation of Causal Effects 


How best might we measure the causal effect on tomato yield (measured in kilo- 
grams) of applying a certain amount of fertilizer, say, 100 grams of fertilizer per 
square meter? 

One way to measure this causal effect is to conduct an experiment. In that exper- 
iment, a horticultural researcher plants many plots of tomatoes. Each plot is tended 
identically, with one exception: Some plots get 100 grams of fertilizer per square 
meter, while the rest get none. Whether or not a plot is fertilized is determined ran- 
domly by a computer, ensuring that any other differences between the plots are 
unrelated to whether they receive fertilizer. At the end of the growing season, the 
horticulturalist weighs the harvest from each plot. The difference between the aver- 
age yield per square meter of the treated and untreated plots is the effect on tomato 
production of the fertilizer treatment. 

This is an example of a randomized controlled experiment. It is controlled in the 
sense that there are both a control group that receives no treatment (no fertilizer) 
and a treatment group that receives the treatment (100 g/m? of fertilizer). It is ran- 
domized in the sense that the treatment is assigned randomly. This random assign- 
ment eliminates the possibility of a systematic relationship between, for example, 
how sunny the plot is and whether it receives fertilizer so that the only systematic 
difference between the treatment and control groups is the treatment. If this experi- 
ment is properly implemented on a large enough scale, then it will yield an estimate 
of the causal effect on the outcome of interest (tomato production) of the treatment 
(applying 100 g/m‘? of fertilizer). 

In this text, the causal effect is defined to be the effect on an outcome of a given 
action or treatment, as measured in an ideal randomized controlled experiment. In 
such an experiment, the only systematic reason for differences in outcomes between 
the treatment and control groups is the treatment itself. 

It is possible to imagine an ideal randomized controlled experiment to answer each 
of the first three questions in Section 1.1. For example, to study class size, one can imag- 
ine randomly assigning “treatments” of different class sizes to different groups of stu- 
dents. If the experiment is designed and executed so that the only systematic difference 
between the groups of students is their class size, then in theory this experiment would 
estimate the effect on test scores of reducing class size, holding all else constant. 

Experiments are used increasingly widely in econometrics. In many applications, 
however, they are not an option because they are unethical, impossible to execute 
satisfactorily, too time-consuming, or prohibitively expensive. Even with non- 
experimental data, the concept of an ideal randomized controlled experiment is 
important because it provides a definition of a causal effect. 


Prediction, Forecasting, and Causality 


Although the first three questions in Section 1.1, concern causal effects, the fourth— 
forecasting the growth rate of GDP—does not. 
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Forecasting is a special case of what statisticians and econometricians call 
prediction, which is using information on some variables to make a statement about 
the value of another variable. A forecast is a prediction about the value of a variable 
in the future, like GDP growth next year. 

You do not need to know a causal relationship to make a good prediction. A 
good way to “predict” whether it is raining is to observe whether pedestrians are 
using umbrellas, but the act of using an umbrella does not cause it to rain. 

When one has a small number of predictors and the data do not evolve over time, 
the multiple regression methods of Part II can provide reliable predictions. Predic- 
tions can often be improved, however, if there is a large number of candidate predic- 
tors. Methods for using many predictors are covered in Chapter 14. 

Forecasts—that is, predictions about the future—use data on variables that 
evolve over time, which introduces new challenges and opportunities. As we will see 
in Chapter 15, multiple regression analysis allows us to quantify historical relation- 
ships, to check whether those relationships have been stable over time, to make quan- 
titative forecasts about the future, and to assess the accuracy of those forecasts. 


Data: Sources and Types 


In econometrics, data come from one of two sources: experiments or nonexperi- 
mental observations of the world. This text examines both experimental and 
nonexperimental data sets. 


Experimental versus Observational Data 


Experimental data come from experiments designed to evaluate a treatment or policy 
or to investigate a causal effect. For example, the state of Tennessee financed a large 
randomized controlled experiment examining class size in the 1980s. In that experiment, 
which we examine in Chapter 13, thousands of students were randomly assigned to 
classes of different sizes for several years and were given standardized tests annually. 

The Tennessee class size experiment cost millions of dollars and required the 
ongoing cooperation of many administrators, parents, and teachers over several years. 
Because real-world experiments with human subjects are difficult to administer and 
to control, they have flaws relative to ideal randomized controlled experiments. More- 
over, in some circumstances, experiments are not only expensive and difficult to 
administer but also unethical. (Would it be ethical to offer randomly selected teenag- 
ers inexpensive cigarettes to see how many they buy?) Because of these financial, 
practical, and ethical problems, experiments in economics are relatively rare. Instead, 
most economic data are obtained by observing real-world behavior. 

Data obtained by observing actual behavior outside an experimental setting are 
called observational data. Observational data are collected using surveys, such as 
telephone surveys of consumers, and administrative records, such as historical records 
on mortgage applications maintained by lending institutions. 
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Observational data pose major challenges to econometric attempts to estimate 
causal effects, and the tools of econometrics are designed to tackle these challenges. 
In the real world, levels of “treatment” (the amount of fertilizer in the tomato exam- 
ple, the student-teacher ratio in the class size example) are not assigned at random, 
so it is difficult to sort out the effect of the “treatment” from other relevant factors. 
Much of econometrics, and much of this text, is devoted to methods for meeting the 
challenges encountered when real-world data are used to estimate causal effects. 

Whether the data are experimental or observational, data sets come in three 
main types: cross-sectional data, time series data, and panel data. In this text, you will 
encounter all three types. 


Cross-Sectional Data 


Data on different entities—workers, consumers, firms, governmental units, and so forth— 
for a single time period are called cross-sectional data. For example, the data on test scores 
in California school districts are cross sectional. Those data are for 420 entities (school 
districts) for a single time period (1999). In general, the number of entities on which we 
have observations is denoted n; so, for example, in the California data set,n = 420. 

The California test score data set contains measurements of several different 
variables for each district. Some of these data are tabulated in Table 1.1. Each row 
lists data for a different district. For example, the average test score for the first dis- 
trict (“district 1”) is 690.8; this is the average of the math and science test scores for 
all fifth-graders in that district in 1999 on a standardized test (the Stanford Achieve- 
ment Test). The average student-teacher ratio in that district is 17.89; that is, the num- 
ber of students in district 1 divided by the number of classroom teachers in district 1 


(ram Selected Observations on Test Scores and Other Variables for California School | 
Districts in 1999 
Observation (District) District Average Student-Teacher Expenditure per Percentage of Students 
Number Test Score (fifth grade) Ratio Pupil ($) Learning English 
1 690.8 17.89 $6385 0.0% 
2 661.2 21.52 5099 4.6 
3 643.6 18.70 5502 30.0 
4 647.7 17.36 7102 0.0 
5 640.8 18.67 5236 13.9 
418 645.0 21.89 4403 24.3 
419 672.2 20.20 4776 3.0 
420 655.8 19.04 5993 5.0 
L Note: The California test score data set is described in Appendix 4.1. J 
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is 1789. Average expenditure per pupil in district 1 is $6385. The percentage of stu- 
dents in that district still learning English—that is, the percentage of students for 
whom English is a second language and who are not yet proficient in English—is 0%. 

The remaining rows present data for other districts. The order of the rows is 
arbitrary, and the number of the district, which is called the observation number, is 
an arbitrarily assigned number that organizes the data. As you can see in the table, 
all the variables listed vary considerably. 

With cross-sectional data, we can learn about relationships among variables by 
studying differences across people, firms, or other economic entities during a single 
time period. 


Time Series Data 


Time series data are data for a single entity (person, firm, country) collected at multiple 
time periods. Our data set on the growth rate of GDP and the term spread in the United 
States is an example of a time series data set. The data set contains observations on two 
variables (the growth rate of GDP and the term spread) for a single entity (the United 
States) for 232 time periods. Each time period in this data set is a quarter of a year (the 
first quarter is January, February, and March; the second quarter is April, May, and June; 
and so forth). The observations in this data set begin in the first quarter of 1960, which is 
denoted 1960:Q1, and end in the fourth quarter of 2017 (2017:Q4). The number of obser- 
vations (that is, time periods) in a time series data set is denoted T. Because there are 232 
quarters from 1960:Q1 to 2017:04, this data set contains T = 232 observations. 

Some observations in this data set are listed in Table 1.2. The data in each row 
correspond to a different time period (year and quarter). In the first quarter of 1960, 


r 
W144 Selected Observations on the Growth Rate of GDP and the Term 
Spread in the United States: Quarterly Data, 1960:Q1-2017:Q4 
Observation Date GDP Growth Rate Term Spread 
Number (year: quarter) (% at an annual rate) (percentage points) 
1 1960:Q1 8.8% 0.6 
2 1960:Q2 -1.5 1.3 
3 1960:Q3 1.0 15 
4 1960:Q4 —4.9 1.6 
5 1961:Q1 2.7 14 
230 2017:Q2 3.0 1.4 
231 2017:Q3 3.1 12 
232 2017:Q4 25 1.2 
Note: The United States GDP and term spread data set is described in Appendix 15.1. 
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for example, GDP grew 8.8% at an annual rate. In other words, if GDP had contin- 
ued growing for four quarters at its rate during the first quarter of 1960, the level of 
GDP would have increased by 8.8%. In the first quarter of 1960, the long-term inter- 
est rate was 4.5%, and the short-term interest rate was 3.9%; so their difference, the 
term spread, was 0.6 percentage points. 

By tracking a single entity over time, time series data can be used to study the 
evolution of variables over time and to forecast future values of those variables. 


Panel Data 


Panel data, also called longitudinal data, are data for multiple entities in which each 
entity is observed at two or more time periods. Our data on cigarette consumption and 
prices are an example of a panel data set, and selected variables and observations in that 
data set are listed in Table 1.3. The number of entities in a panel data set is denoted n, 
and the number of time periods is denoted T. In the cigarette data set, we have observa- 
tions on n = 48 continental U.S. states (entities) for T = 11 years (time periods) from 


1985 to 1995. Thus, there is a total ofn X T = 48 X 11 = 528 observations. 
Some data from the cigarette consumption data set are listed in Table 1.3. The 


first block of 48 observations lists the data for each state in 1985, organized alphabeti- 
cally from Alabama to Wyoming. The next block of 48 observations lists the data for 


Observation 
Number 


1 
2 
3 


47 
48 
49 


96 
97 


528 


Ki 


P 
WANAE Selected Observations on Cigarette Sales, Prices, and Taxes, by State and Year for 


U.S. States, 1985-1995 


Average Price Total Taxes 
Cigarette Sales per Pack (cigarette 
State Year (packs per capita) (including taxes) excise tax + sales tax) 

Alabama 1985 116.5 $1.022 $0.333 
Arkansas 1985 128.5 1.015 0.370 
Arizona 1985 104.5 1.086 0.362 
West Virginia 1985 112.8 1.089 0.382 
Wyoming 1985 129.4 0.935 0.240 
Alabama 1986 1172 1.080 0.334 
Wyoming 1986 1278 1.007 0.240 
Alabama 1987 115.8 1.135 0.335 
Wyoming 1995 112.2 1.585 0.360 


Note: The cigarette consumption data set is described in Appendix 12.1. 
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Cross-Sectional, Time Series, and Panel Data 
e Cross-sectional data consist of multiple entities observed at a single time 1 A 
period. 


e Time series data consist of a single entity observed at multiple time periods. 


e Panel data (also known as longitudinal data) consist of multiple entities, 
where each entity is observed at two or more time periods. 


1986, and so forth, through 1995. For example, in 1985, cigarette sales in Arkansas 
were 128.5 packs per capita (the total number of packs of cigarettes sold in Arkansas 
in 1985 divided by the total population of Arkansas in 1985 equals 128.5). The aver- 
age price of a pack of cigarettes in Arkansas in 1985, including tax, was $1.015, of 
which 37¢ went to federal, state, and local taxes. 

Panel data can be used to learn about economic relationships from the experi- 
ences of the many different entities in the data set and from the evolution over time 
of the variables for each entity. 

The definitions of cross-sectional data, time series data, and panel data are sum- 
marized in Key Concept 1.1. 


Summary 


1. Many decisions in business and economics require quantitative estimates of 
how a change in one variable affects another variable. 

2. Conceptually, the way to estimate a causal effect is in an ideal randomized 
controlled experiment, but performing experiments in economic applications 
can be unethical, impractical, or too expensive. 

3. Econometrics provides tools for estimating causal effects using either observa- 
tional (nonexperimental) data or data from real-world, imperfect experiments. 

4. Econometrics also provides tools for predicting the value of a variable of 
interest using information in other, related variables. 

5. Cross-sectional data are gathered by observing multiple entities at a single 
point in time; time series data are gathered by observing a single entity at mul- 
tiple points in time; and panel data are gathered by observing multiple entities, 
each of which is observed at multiple points in time. 


Key Terms 


randomized controlled experiment (48) treatment group (48) 
control group (48) causal effect (48) 


54 CHAPTER1 Economic Questions and Data 


prediction (49) observation number (51) 
forecast (49) time series data (51) 
experimental data (49) panel data (52) 
observational data (49) longitudinal data (52) 


cross-sectional data (50) 
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Review the Concepts 


1.1 Describe a hypothetical ideal randomized controlled experiment to study the 
effect of six hours of reading on the improvement of the vocabulary of high 
school students. Suggest some impediments to implementing this experiment 
in practice. 


1.2 Describe a hypothetical ideal randomized controlled experiment to study the 
effect of the consumption of alcohol on long-term memory loss. Suggest some 
impediments to implementing this experiment in practice. 


1.3 You are asked to study the causal effect of hours spent on employee training 
(measured in hours per worker per week) in a manufacturing plant on the 
productivity of its workers (output per worker per hour). Describe: 

a. an ideal randomized controlled experiment to measure this causal effect; 


b. an observational cross-sectional data set with which you could study this 
effect; 


c. an observational time series data set for studying this effect; and 


d. an observational panel data set for studying this effect. 


Review of Probability 


E chapter reviews the core ideas of the theory of probability that are needed to 
understand regression analysis and econometrics. We assume that you have taken 
an introductory course in probability and statistics. If your knowledge of probability is 
stale, you should refresh it by reading this chapter. If you feel confident with the mate- 
rial, you still should skim the chapter and the terms and concepts at the end to make 
sure you are familiar with the ideas and notation. 

Most aspects of the world around us have an element of randomness. The theory 
of probability provides mathematical tools for quantifying and describing this random- 
ness. Section 2.1 reviews probability distributions for a single random variable, and 
Section 2.2 covers the mathematical expectation, mean, and variance of a single ran- 
dom variable. Most of the interesting problems in economics involve more than one 
variable, and Section 2.3 introduces the basic elements of probability theory for two 
random variables. Section 2.4 discusses three special probability distributions that 
play a central role in statistics and econometrics: the normal, chi-squared, and 
F distributions. 

The final two sections of this chapter focus on a specific source of randomness of 
central importance in econometrics: the randomness that arises by randomly drawing 
a sample of data from a larger population. For example, suppose you survey ten recent 
college graduates selected at random, record (or “observe”) their earnings, and com- 
pute the average earnings using these ten data points (or “observations”). Because you 
chose the sample at random, you could have chosen ten different graduates by pure 
random chance; had you done so, you would have observed ten different earnings, 
and you would have computed a different sample average. Because the average earn- 
ings vary from one randomly chosen sample to the next, the sample average is itself a 
random variable. Therefore, the sample average has a probability distribution, which is 
referred to as its sampling distribution because this distribution describes the different 
possible values of the sample average that would have occurred had a different sample 
been drawn. 

Section 2.5 discusses random sampling and the sampling distribution of the sam- 
ple average. This sampling distribution is, in general, complicated. When the sample 
size is sufficiently large, however, the sampling distribution of the sample average is 
approximately normal, a result known as the central limit theorem, which is discussed 
in Section 2.6. 
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2.1 


Random Variables and Probability 
Distributions 


Probabilities, the Sample Space, and Random Variables 


Probabilities and outcomes. The sex of the next new person you meet, your grade 
on an exam, and the number of times your wireless network connection fails while 
you are writing a term paper all have an element of chance or randomness. In each 
of these examples, there is something not yet known that is eventually revealed. 

The mutually exclusive potential results of a random process are called the 
outcomes. For example, while writing your term paper, the wireless connection might 
never fail, it might fail once, it might fail twice, and so on. Only one of these outcomes 
will actually occur (the outcomes are mutually exclusive), and the outcomes need not 
be equally likely. 

The probability of an outcome is the proportion of the time that the outcome 
occurs in the long run. If the probability of your wireless connection not failing while 
you are writing a term paper is 80%, then over the course of writing many term 
papers, you will complete 80% without a wireless connection failure. 


The sample space and events. The set of all possible outcomes is called the sample 
space. An event is a subset of the sample space; that is, an event is a set of one or more 
outcomes. The event “my wireless connection will fail no more than once” is the set 
consisting of two outcomes: “no failures” and “one failure.” 


Random variables. A random variable is a numerical summary of a random out- 
come. The number of times your wireless connection fails while you are writing a 
term paper is random and takes on a numerical value, so it is a random variable. 
Some random variables are discrete and some are continuous. As their names sug- 
gest, a discrete random variable takes on only a discrete set of values, like 0,1,2,..., 
whereas a continuous random variable takes on a continuum of possible values. 


Probability Distribution of a Discrete Random Variable 


Probability distribution. The probability distribution of a discrete random variable 
is the list of all possible values of the variable and the probability that each value will 
occur. These probabilities sum to 1. 

For example, let M be the number of times your wireless network connection 
fails while you are writing a term paper. The probability distribution of the random 
variable M is the list of probabilities of all possible outcomes: The probability that 
M = 0, denoted Pr(M = 0), is the probability of no wireless connection failures; 
Pr(M = 1)is the probability of a single connection failure; and so forth. An example 
of a probability distribution for M is given in the first row of Table 2.1. According to 
this distribution, the probability of no connection failures is 80%; the probability of 
one failure is 10%; and the probabilities of two, three, and four failures are, 
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Probability of Your Wireless Network Connection Failing M Times 


Outcome (number of failures) 


Probability distribution 0.80 0.10 0.06 0.03 0.01 
Cumulative probability distribution 0.80 0.90 0.96 0.99 1.00 


respectively, 6%, 3%, and 1%. These probabilities sum to 100%. This probability 
distribution is plotted in Figure 2.1. 


Probabilities of events. The probability of an event can be computed from the prob- 
ability distribution. For example, the probability of the event of one or two failures 
is the sum of the probabilities of the constituent outcomes. That is, 
Pr(M =1 or M=2) = Pr(M = 1) + Pr(M = 2) = 0.10 + 0.06 = 0.16, or 16%. 


Cumulative probability distribution. The cumulative probability distribution is the 
probability that the random variable is less than or equal to a particular value. The final 
row of Table 2.1 gives the cumulative probability distribution of the random variable M. 
For example, the probability of at most one connection failure, Pr (M = 1), is 90%, 
which is the sum of the probabilities of no failures (80%) and of one failure (10%). 

A cumulative probability distribution is also referred to as a cumulative distribu- 
tion function, a c.d.f., or a cumulative distribution. 


| FIGURE 2.1} Probability Distribution of the Number of Wireless Network Connection Failures 


The height of each bar is the probability that the Probability 
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times. The height of the first bar is 0.8, so the prob- 

ability of 0 connection failures is 80%. The height 07e 

of the second bar is 0.1, so the probability of 

1 failure is 10%, and so forth for the other bars. 0.6 L 
0.5 H 
0.4 - 
0.3 F 
O22 


0.1} F 
0 1 2 3 4 
Number of failures 


58 


CHAPTER 2_ Review of Probability 


The Bernoulli distribution. An important special case of a discrete random variable 
is when the random variable is binary; that is, the outcome is 0 or 1. A binary random 
variable is called a Bernoulli random variable (in honor of the 17th-century Swiss 
mathematician and scientist Jacob Bernoulli), and its probability distribution is 
called the Bernoulli distribution. 

For example, let G be the sex of the next new person you meet, where G = 0 
indicates that the person is male and G = 1 indicates that the person is female. The 
outcomes of G and their probabilities thus are 

_ T with probability p (2.1) 
0 with probability 1 — p, 


where p is the probability of the next new person you meet being a woman. The prob- 
ability distribution in Equation (2.1) is the Bernoulli distribution. 


Probability Distribution of a Continuous 
Random Variable 


Cumulative probability distribution. The cumulative probability distribution for a 
continuous variable is defined just as it is for a discrete random variable. That is, the 
cumulative probability distribution of a continuous random variable is the probabil- 
ity that the random variable is less than or equal to a particular value. 

For example, consider a student who drives from home to school. This student’s 
commuting time can take on a continuum of values, and because it depends on ran- 
dom factors such as the weather and traffic conditions, it is natural to treat it as a 
continuous random variable. Figure 2.2a plots a hypothetical cumulative distribution 
of commuting times. For example, the probability that the commute takes less than 
15 minutes is 20%, and the probability that it takes less than 20 minutes is 78%. 


Probability density function. Because a continuous random variable can take on a con- 
tinuum of possible values, the probability distribution used for discrete variables, which 
lists the probability of each possible value of the random variable, is not suitable for 
continuous variables. Instead, the probability is summarized by the probability density 
function. The area under the probability density function between any two points is the 
probability that the random variable falls between those two points. A probability 
density function is also called a p.d.f., a density function, or simply a density. 

Figure 2.2b plots the probability density function of commuting times corre- 
sponding to the cumulative distribution in Figure 2.2a. The probability that the com- 
mute takes between 15 and 20 minutes is given by the area under the p.d.f. between 
15 minutes and 20 minutes, which is 0.58, or 58%. Equivalently, this probability can 
be seen on the cumulative distribution in Figure 2.2a as the difference between the 
probability that the commute is less than 20 minutes (78%) and the probability that 
it is less than 15 minutes (20% ). Thus the probability density function and the cumu- 
lative probability distribution show the same information in different formats. 
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(b) Probability density function of commuting times 


Figure 2.2a shows the cumulative probability distribution function (c.d.f.) of commuting times. The probability that a 
commuting time is less than 15 minutes is 0.20 (or 20%), and the probability that it is less than 20 minutes is 0.78 (78%). 
Figure 2.2b shows the probability density function (or p.d.f.) of commuting times. Probabilities are given by areas 
under the p.d.f. The probability that a commuting time is between 15 and 20 minutes is 0.58 (58%) and is given by the 
area under the curve between 15 and 20 minutes. 
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2.2 Expected Values, Mean, and Variance 


The Expected Value of a Random Variable 


Expected value. The expected value of a random variable Y, denoted E(Y), is the 
long-run average value of the random variable over many repeated trials or occur- 
rences. The expected value of a discrete random variable is computed as a weighted 
average of the possible outcomes of that random variable, where the weights are the 
probabilities of that outcome. The expected value of Y is also called the expectation 
of Y or the mean of Y and is denoted py. 

For example, suppose you loan a friend $100 at 10% interest. If the loan is repaid, 
you get $110 (the principal of $100 plus interest of $10), but there is a risk of 1% that 
your friend will default and you will get nothing at all. Thus the amount you are 
repaid is a random variable that equals $110 with probability 0.99 and equals $0 with 
probability 0.01. Over many such loans, 99% of the time you would be paid back 
$110, but 1% of the time you would get nothing, so on average you would be repaid 
$110 x 0.99 + $0 Xx 0.01 = $108.90. Thus the expected value of your repayment is 
$108.90. 

As asecond example, consider the number of wireless network connection failures 
M with the probability distribution given in Table 2.1. The expected value of M—that 
is, the mean of M—is the average number of failures over many term papers, weighted 
by the frequency with which a given number of failures occurs. Accordingly, 


E(M) = 0 X 0.80 + 1 X 0.10 + 2 X 0.06 + 3 x 0.03 + 4 x 0.01 = 0.35. (2.2) 


That is, the expected number of connection failures while writing a term paper is 0.35. 
Of course, the actual number of failures must always be an integer; it makes no sense 
to say that the wireless connection failed 0.35 times while writing a particular term 
paper! Rather, the calculation in Equation (2.2) means that the average number of 
failures over many such term papers is 0.35. 

The formula for the expected value of a discrete random variable Y that can take 
on k different values is given in Key Concept 2.1. (Key Concept 2.1 uses summation 
notation, which is reviewed in Exercise 2.25.) 


Expected Value and the Mean 


Z| 


Suppose that the random variable Y takes on k possible values, y4, ... , Yg, where 
yı denotes the first value, y) denotes the second value, and so forth, and that the 
probability that Y takes on y, is pı, the probability that Y takes on yz is p>, and so 
forth. The expected value of Y, denoted E(Y), is 


k 
E(Y) = yipi + yopr +++ + YkPk = > yip» Ca) 
2 


where the notation DA iP; means “the sum of y;p; for i running from 1 to k.” 
The expected value of Y is also called the mean of Y or the expectation of Y and 
is denoted py. 
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Expected value of a Bernoulli random variable. An important special case of the 
general formula in Key Concept 2.1 is the mean of a Bernoulli random variable. 
Let G be the Bernoulli random variable with the probability distribution in 
Equation (2.1). The expected value of G is 


E(G) =0xX (1-p)+1xXp=p. (2.4) 


Thus the expected value of a Bernoulli random variable is p, the probability that it 
takes on the value 1. 


Expected value of a continuous random variable. The expected value of a continu- 
ous random variable is also the probability-weighted average of the possible out- 
comes of the random variable. Because a continuous random variable can take on a 
continuum of possible values, the formal mathematical definition of its expectation 
involves calculus and its definition is given in Appendix 18.1. 


The Standard Deviation and Variance 


The variance and standard deviation measure the dispersion or the “spread” of a 
probability distribution. The variance of a random variable Y, denoted var(Y), is the 
expected value of the square of the deviation of Y from its mean: var(Y) = 
E[(¥ - py)?]. 

Because the variance involves the square of Y, the units of the variance are the 
units of the square of Y, which makes the variance awkward to interpret. It is there- 
fore common to measure the spread by the standard deviation, which is the square 
root of the variance and is denoted oy. The standard deviation has the same units 


as Y. These definitions are summarized in Key Concept 2.2. 
For example, the variance of the number of connection failures M is the 
probability-weighted average of the squared difference between M and its mean, 0.35: 


var (M) = (0 — 0.35)? x 0.80 + (1 — 0.35)? x 0.10 + (2 — 0.35)? x 0.06 
+ (3 — 0.35)? x 0.03 + (4 — 0.35)? x 0.01 = 0.6475. (2.5) 


The standard deviation of M is the square root of the variance, so ay = 


V 0.64750 = 0.80. 


Variance and Standard Deviation 


The variance of the discrete random variable Y, denoted o¥, is 


2.2 


k 
o} = var(¥) = E[(Y ~ wy)"] = (1 - ay)" (2.6) 


i= 


The standard deviation of Y is oy, the square root of the variance. The units of the 


standard deviation are the same as the units of Y. 
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Variance of a Bernoulli random variable. The mean of the Bernoulli random vari- 
able G with the probability distribution in Equation (2.1) is ug = p [Equation (2.4)], 
so its variance is 


var (G) = 0G = (0-p)?X (1-~p) + (1-p)?Xp=p(1-p). (27) 


Thus the standard deviation of a Bernoulli random variable is og = Vp(1 — p). 


Mean and Variance of a Linear Function 
of a Random Variable 


This section discusses random variables (say, X and Y) that are related by a linear func- 
tion. For example, consider an income tax scheme under which a worker is taxed at a rate 
of 20% on his or her earnings and then given a (tax-free) grant of $2000. Under this tax 
scheme, after-tax earnings Y are related to pre-tax earnings X by the equation 


Y = 2000 + 0.8X. (2.8) 


That is, after-tax earnings Y is 80% of pre-tax earnings X, plus $2000. 

Suppose an individual’s pre-tax earnings next year are a random variable with 
mean uy and variance 0%. Because pre-tax earnings are random, so are after-tax 
earnings. What are the mean and standard deviations of her after-tax earnings under 
this tax? After taxes, her earnings are 80% of the original pre-tax earnings, plus 
$2000. Thus the expected value of her after-tax earnings is 


E(Y) = py = 2000 + O.8py. (2.9) 


The variance of after-tax earnings is the expected value of (Y — py)’. Because 
Y = 2000 + 0.8X, Y — py = 2000 + 0.8X — (2000 + 0.8uy) = 0.8(X — py). 
Thus E| (Y — py)?] = E{[0.8(X — py)]?} = 0.64E[ (X — wy)*]. It follows that 
var( Y) = 0.64var(X’),so, taking the square root of the variance, the standard devia- 
tion of Y is 

oy = 0.80. (2.10) 


That is, the standard deviation of the distribution of her after-tax earnings is 80% of 
the standard deviation of the distribution of her pre-tax earnings. 

This analysis can be generalized so that Y depends on X with an intercept a 
(instead of $2000) and a slope b (instead of 0.8) so that 


Y = a + bX. (2.11) 

Then the mean and variance of Y are 
uy = a+ buy and (2.12) 
ay = boy, (2.13) 


and the standard deviation of Y is oy = bøy. The expressions in Equations (2.9) and 
(2.10) are applications of the more general formulas in Equations (2.12) and (2.13) 
with a = 2000 and b = 0.8. 
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Other Measures of the Shape of a Distribution 


The mean and standard deviation measure two important features of a distribution: 
its center (the mean) and its spread (the standard deviation). This section discusses 
measures of two other features of a distribution: the skewness, which measures the 
lack of symmetry of a distribution, and the kurtosis, which measures how thick, or 
“heavy,” are its tails. The mean, variance, skewness, and kurtosis are all based on what 
are called the moments of a distribution. 


Skewness. Figure 2.3 plots four distributions, two that are symmetric (Figures 2.3a 
and 2.3b) and two that are not (Figures 2.3c and 2.3d). Visually, the distribution in 
Figure 2.3d appears to deviate more from symmetry than does the distribution in 
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(c) Skewness = —0.1, kurtosis = 5 (d) Skewness = 0.6, kurtosis = 5 


All of these distributions have a mean of 0 and a variance of 1. The distributions with skewness of 0 (a and b) are 
symmetric; the distributions with nonzero skewness (c and d) are not symmetric. The distributions with kurtosis 
exceeding 3 (b, c, and d) have heavy tails. 


Nu 
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Figure 2.3c. The skewness of a distribution provides a mathematical way to describe 
how much a distribution deviates from symmetry. 
The skewness of the distribution of a random variable Y is 
E[(Y -= i 
Skewness = L mr) ] (2.14) 


oy 


where oy is the standard deviation of Y. For a symmetric distribution, a value of Ya 
given amount above its mean is just as likely as a value of Y the same amount below 
its mean. If so, then positive values of (Y — wy)? will be offset on average (in expec- 
tation) by equally likely negative values. Thus, for a symmetric distribution, 
E(Y — py)? = 0: The skewness of a symmetric distribution is 0. If a distribution is 
not symmetric, then a positive value of (Y — wy)? generally is not offset on average 
by an equally likely negative value, so the skewness is nonzero for a distribution that 
is not symmetric. Dividing by ø$ in the denominator of Equation (2.14) cancels the 
units of Y’ in the numerator, so the skewness is unit free; in other words, changing 
the units of Y does not change its skewness. 

Below each of the four distributions in Figure 2.3 is its skewness. If a distribution has 
a long right tail, positive values of (Y — uy)? are not fully offset by negative values, and 
the skewness is positive. If a distribution has a long left tail, its skewness is negative. 


Kurtosis. The kurtosis of a distribution is a measure of how much mass is in its tails 
and therefore is a measure of how much of the variance of Y arises from extreme 
values. An extreme value of Y is called an outlier. The greater the kurtosis of a dis- 
tribution, the more likely are outliers. 

The kurtosis of the distribution of Y is 


E[(Y = py)” 
Kurtosis = [C ize J (2.15) 
Oy 


If a distribution has a large amount of mass in its tails, then some extreme departures 
of Y from its mean are likely, and these departures will lead to large values, on aver- 
age (in expectation), of (Y — wy)*. Thus, for a distribution with a large amount of 
mass in its tails, the kurtosis will be large. Because (Y — py) * cannot be negative, the 
kurtosis cannot be negative. 

The kurtosis of a normally distributed random variable is 3, so a random variable 
with kurtosis exceeding 3 has more mass in its tails than a normal random variable. 
A distribution with kurtosis exceeding 3 is called leptokurtic or, more simply, heavy- 
tailed. Like skewness, the kurtosis is unit free, so changing the units of Y does not 
change its kurtosis. 

Below each of the four distributions in Figure 2.3 is its kurtosis. The distributions 
in Figures 2.3b-d are heavy-tailed. 


Moments. The mean of Y, E( Y ),is also called the first moment of Y,and the expected 
value of the square of Y, E(Y7), is called the second moment of Y. In general, the 
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expected value of Y” is called the r™ moment of the random variable Y. That is, the 7™ 
moment of Y is E( Y”). The skewness is a function of the first, second, and third 
moments of Y, and the kurtosis is a function of the first through fourth moments of Y. 


Standardized Random Variables 


A random variable can be transformed into a random variable with mean 0 and vari- 
ance 1 by subtracting its mean and then dividing by its standard deviation, a process 
called standardization. Specifically, let Y have mean wy and variance oẸ. Then the 
standardized random variable computed from Y is (Y — py) /oy. The mean of the 
standardized random variable is E(Y — wy)/oy = (EY — py)/oy = 0, and its 
variance is var[(Y — wy) /oy] = var(Y)/o} = 1. Standardized random variables 
do not have any units, such as dollars or meters, because the units of Y are canceled 
by dividing through by ay, which also has the units of Y. 


Two Random Variables 


Most of the interesting questions in economics involve two or more variables. Are 
college graduates more likely to have a job than nongraduates? How does the distri- 
bution of income for women compare to that for men? These questions concern the 
distribution of two random variables, considered together (education and employ- 
ment status in the first example, income and sex in the second). Answering such 
questions requires an understanding of the concepts of joint, marginal, and condi- 
tional probability distributions. 


Joint and Marginal Distributions 


Joint distribution. The joint probability distribution of two discrete random variables, 
say X and Y, is the probability that the random variables simultaneously take on cer- 
tain values, say x and y. The probabilities of all possible (x, y) combinations sum to 1. 
The joint probability distribution can be written as the function Pr(X = x, Y = y). 

For example, weather conditions—whether or not it is raining —affect the com- 
muting time of the student commuter in Section 2.1. Let Y be a binary random vari- 
able that equals 1 if the commute is short (less than 20 minutes) and that equals 0 
otherwise, and let X be a binary random variable that equals 0 if it is raining and 1 if 
not. Between these two random variables, there are four possible outcomes: it rains 
and the commute is long (X = 0, Y = 0);rain and short commute (X = 0, Y = 1); 
no rain and long commute (X = 1, Y = 0); and no rain and short commute 
(X = 1, Y = 1).The joint probability distribution is the frequency with which each 
of these four outcomes occurs over many repeated commutes. 

An example of a joint distribution of these two variables is given in Table 2.2. 
According to this distribution, over many commutes, 15% of the days have rain and 
along commute (X = 0, Y = 0); that is, the probability of a long rainy commute is 
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Wi Joint Distribution of Weather Conditions and Commuting Times 


Rain (X = 0) No Rain (X = 1) Total 
Long commute (Y = 0) 0.15 0.07 0.22 
Short commute (Y = 1) 0.15 0.63 0.78 
Total 0.30 0.70 1.00 


15%, or Pr(X = 0, Y = 0) = 0.15. Also, Pr(X = 0, Y = 1) = 0.15, Pr(X = 1, 
Y = 0) = 0.07, and Pr(X = 1, Y = 1) = 0.63. These four possible outcomes 
are mutually exclusive and constitute the sample space, so the four probabilities 
sum to 1. 


Marginal probability distribution. The marginal probability distribution of a ran- 
dom variable Y is just another name for its probability distribution. This term is used 
to distinguish the distribution of Y alone (the marginal distribution) from the joint 
distribution of Y and another random variable. 

The marginal distribution of Y can be computed from the joint distribution of X 
and Y by adding up the probabilities of all possible outcomes for which Y takes 
on a specified value. If X can take on / different values x4, . . . , x), then the marginal 
probability that Y takes on the value y is 


Pr(Y=y) = SPr(X = x; Y= y). (2.16) 
1 


For example, in Table 2.2, the probability of a long rainy commute is 15%, and the 
probability of a long commute with no rain is 7%, so the probability of a long com- 
mute (rainy or not) is 22%.The marginal distribution of commuting times is given in 
the final column of Table 2.2. Similarly, the marginal probability that it will rain is 
30%, as shown in the final row of Table 2.2. 


Conditional Distributions 


Conditional distribution. The distribution of a random variable Y conditional on 
another random variable X taking on a specific value is called the conditional 
distribution of Y given X. The conditional probability that Y takes on the value y 
when X takes on the value x is written Pr( Y = y|X = x). 

For example, what is the probability of a long commute ( Y = 0) if you know it 
is raining (X = 0)? From Table 2.2, the joint probability of a rainy short commute 
is 15%, and the joint probability of a rainy long commute is 15%, so if it is raining, 
a long commute and a short commute are equally likely. Thus the probability of a 
long commute (Y = 0) conditional on it being rainy (X = 0) is 50%, or 
Pr(Y = 0 |X = 0) = 0.50. Equivalently, the marginal probability of rain is 30%; 
that is, over many commutes, it rains 30% of the time. Of this 30% of commutes, 
50% of the time the commute is long (0.15 /0.30). 


Joint and Conditional Distributions of Number of Wireless Connection 
Failures (M) and Network Age (A) 


A. Joint Distribution 


M=0 M=1 M=2 M=3 M=4 Total 
Old network (A = 0) 0.35 0.065 0.05 0.025 0.01 0.50 
New network (A = 1) 0.45 0.035 0.01 0.005 0.00 0.50 
Total 0.80 0.10 0.06 0.03 0.01 1.00 


B. Conditional Distributions of M given A 
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M=0 M=1 M=2 M=3 M=4 Total 
Pr(M|A = 0) 0.70 0.13 0.10 0.05 0.02 1.00 
Pr(M|A = 1) 0.90 0.07 0.02 0.01 0.00 1.00 
Ne = 
In general, the conditional distribution of Y given X = x is 
Pr(X =x, Y = y) 
Pr(Y = y|X =x) = (2.17) 


Pr(X = x) 


For example, the conditional probability of a long commute given that it is rainy 
is Pr(Y = 0|X = 0) = Pr(X = 0, Y = 0) /Pr(X = 0) = 0.15 /0.30 = 0.50. 

As a second example, consider a modification of the network connection failure 
example. Suppose that half the time you write your term paper in the school library, 
which has a new wireless network; otherwise, you write it in your room, which has an 
old wireless network. If we treat the location where you write the term paper as 
random, then the network age A ( = 1 if the network is new, = 0 if it is old) is a 
random variable. Suppose the joint distribution of the random variables M and A is 
given in Part A of Table 2.3. Then the conditional distributions of connection failures 
given the age of the network are shown in Part B of the table. For example, the joint 
probability of M = Oand A = Ois 0.35; because half the time you use the old network, 
the conditional probability of no failures given that you use the old network is 
Pr(M = 0|A = 0) = Pr (M =0, A =0)/Pr (A =0) = 0.35 /0.50 = 0.70, or 70%. 
In contrast, the conditional probability of no failures given that you use the new 
network is 90%. According to the conditional distributions in Part B of Table 2.3, the 
new network is less likely to fail than the old one; for example, the probability of 
three failures is 5% using the old network but 1% using the new network. 


Conditional expectation. The conditional expectation of Y given X, also called the 
conditional mean of Y given X, is the mean of the conditional distribution of Y 
given X. That is, the conditional expectation is the expected value of Y, computed 
using the conditional distribution of Y given X. If Y takes on k values y4, . . . , Yg, then 
the conditional mean of Y given X = x is 


k 
E(Y|X = x) = Dee = |X = x). (2.18) 
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For example, based on the conditional distributions in Table 2.3, the expected 
number of connection failures, given that the network is old, is E(M|A = 0) = 
0 x 0.70 + 1 x 0.13 +2 x 0.10 + 3 x 0.05 + 4 x 0.02 = 0.56. The expected num- 
ber of failures, given that the network is new, is E(M|A = 1) = 0.14, less than for the 
old network. 

The conditional expectation of Y given X = x is just the mean value of Y when 
X = x.In the example of Table 2.3, the mean number of failures is 0.56 for the old 
network, so the conditional expectation of Y given that the network is old is 0.56. 
Similarly, for the new network, the mean number of failures is 0.14; that is, the con- 
ditional expectation of Y given that the network is new is 0.14. 


The law of iterated expectations. The mean of Y is the weighted average of the 
conditional expectation of Y given X, weighted by the probability distribution of X. 
For example, the mean height of adults is the weighted average of the mean height 
of men and the mean height of women, weighted by the proportions of men and 
women. Stated mathematically, if X takes on the / values x4, . . . , x; then 


E(Y) = SE(Y|X = x;)Pr(X = x;). (2.19) 
i=1 


Equation (2.19) follows from Equations (2.18) and (2.17) (see Exercise 2.19). 
Stated differently, the expectation of Y is the expectation of the conditional 
expectation of Y given X, 


E(Y) = E[E(¥|X)], (2.20) 


where the inner expectation on the right-hand side of Equation (2.20) is computed 
using the conditional distribution of Y given X and the outer expectation is com- 
puted using the marginal distribution of X. Equation (2.20) is known as the law of 
iterated expectations. 

For example, the mean number of connection failures M is the weighted aver- 
age of the conditional expectation of M given that it is old and the conditional 
expectation of M given that it is new, so E(M) = E(M|A = 0) X Pr(A = 0) + 
E(M|A = 1) X Pr(A = 1) = 0.56 x 0.50 + 0.14 x 0.50 = 0.35.This is the mean 
of the marginal distribution of M, as calculated in Equation (2.2). 

The law of iterated expectations implies that if the conditional mean of Y given 
X is 0, then the mean of Y is 0. This is an immediate consequence of Equation (2.20): 
if E(Y|X) = 0, then E(Y) = E[E(Y|X)] = E[0] = 0. Said differently, if the 
mean of Y given X is 0, then it must be that the probability-weighted average of these 
conditional means is 0; that is, the mean of Y must be 0. 

The law of iterated expectations also applies to expectations that are conditional 
on multiple random variables. For example, let X, Y, and Z be random variables 
that are jointly distributed. Then the law of iterated expectations says that 
E(Y) = E[E(Y|X, Z)], where E(Y|X, Z) is the conditional expectation of Y 
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given both X and Z. For example, in the network connection illustration of Table 2.3, 
let P denote the number of people using the network; then E(M|A, P) is the 
expected number of failures for a network with age A that has P users. The expected 
number of failures overall, E(M), is the weighted average of the expected number 
of failures for a network with age A and number of users P, weighted by the propor- 
tion of occurrences of both A and P. 

Exercise 2.20 provides some additional properties of conditional expectations 
with multiple variables. 


Conditional variance. The variance of Y conditional on X is the variance of 
the conditional distribution of Y given X. Stated mathematically, the conditional 
variance of Y given X is 


k 
var (Y|X =x) = > Ly — E(Y|X = x) P Pr(Y = y;|X = x). 221) 


For example, the conditional variance of the number of failures given that the 
network is old is var(M|A = 0) = (0 — 0.56)? x 0.70 + (1 — 0.56)? x 0.13 + 
(2 — 0.56)? x 0.10 + (3 — 0.56)? x 0.05 + (4 — 0.56)? x 0.02 = 0.99.The stan- 
dard deviation of the conditional distribution of M given that A = 0 is thus 
V 0.99 = 0.99. The conditional variance of M given that A = 1 is the variance of the 
distribution in the second row of Part B of Table 2.3, which is 0.22, so the standard 
deviation of M for the new network is V 0.22 = 0.47. For the conditional distribu- 
tions in Table 2.3, the expected number of failures for the new network (0.14) is less 
than that for the old network (0.56), and the spread of the distribution of the number 
of failures, as measured by the conditional standard deviation, is smaller for the new 
network (0.47) than for the old (0.99). 


Bayes’ rule. Bayes’ rule says that the conditional probability of Y given X is the 
conditional probability of X given Y times the relative marginal probabilities of Y 
and X: 


Pr(X = x| Y = y)Pr(Y = y) 
Pr(X = x) 
Equation (2.22) obtains from the definition of the conditional distribution in Equa- 
tion (2.17), which implies that Pr (X = x,Y = y) = Pr(Y = y|X = x) Pr(X = x) 
and that Pr (X = x,Y = y) = Pr(X = x| Y = y)Pr(Y = y); equating the second 

parts of these two equalities and rearranging gives Bayes’ rule. 


Pr(Y=y|X =x) = (Bayes’ rule). (2.22) 


Bayes’ rule can be used to deduce conditional probabilities from the reverse 
conditional probability, with the help of marginal probabilities. For example, suppose 
you told your friend that you were dropped by the network three times last night 
while working on your term paper and your friend knows that half the time you work 
in the library and half the time you work in your room. Then your friend could 
deduce from Table 2.3 that the probability you worked in your room last night given 
three network failures is 83% (Exercise 2.28). 
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The conditional mean is the minimum mean squared error prediction. The condi- 
tional mean plays a central role in prediction; in fact it is, in a precise sense, the opti- 
mal prediction of Y given X = x. 

A common formulation of the statistical prediction problem is to posit that the cost 
of making a prediction error increases with the square of that error. The motivation for 
this squared-error prediction loss is that small errors in prediction might not matter 
much, but large errors can be very costly in real-world applications. Stated mathemati- 
cally, the prediction problem thus is: what is the function g(X) that minimizes the mean 
squared prediction error, E{[ Y — g(X)]?}? The answer is the conditional mean 
E(Y|X): Of all possible ways to use the information X, the conditional mean minimizes 
the mean squared prediction error. This result is proven in Appendix 2.2. 


Independence 


Two random variables X and Y are independently distributed, or independent, if 
knowing the value of one of the variables provides no information about the other. 
Specifically, X and Y are independent if the conditional distribution of Y given X 
equals the marginal distribution of Y. That is, X and Y are independently distributed 
if, for all values of x and y, 


Pr(Y = y|X = x) = Pr(Y = y) (independence of X and Y). (2.23) 


Substituting Equation (2.23) into Equation (2.17) gives an alternative expression for 
independent random variables in terms of their joint distribution. If X and Y are 
independent, then 


Pr(X = x, Y = y) = Pr(X = x)Pr(Y = y). (2.24) 


That is, the joint distribution of two independent random variables is the product of 
their marginal distributions. 


Covariance and Correlation 

Covariance. One measure of the extent to which two random variables move 
together is their covariance. The covariance between X and Y is the expected value 
E| (X — wx) (Y — py) ], where uy is the mean of X and py is the mean of Y. The 
covariance is denoted cov(X, Y) or oyy. If X can take on / values and Y can take on 
k values, then the covariance is given by the formula 


cov(X, Y) = oxy = El (X — ux) (Y — uy)] 
k l 
= > > (4 — ux) (yi — uy)Pr(X = x, Y = yj). (2.25) 
i=1j= 


To interpret this formula, suppose that when X is greater than its mean (so that 
X — py is positive), then Y tends be greater than its mean (so that Y — py is 
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positive) and that when X is less than its mean (so that X — uy < 0), then Y tends 
to be less than its mean (so that Y — py < 0). In both cases, the product 
(X — py) X (Y — py) tends to be positive, so the covariance is positive. In contrast, 
if X and Y tend to move in opposite directions (so that X is large when Y is small, 
and vice versa), then the covariance is negative. Finally, if X and Y are independent, 
then the covariance is 0 (see Exercise 2.19). 


Correlation. Because the covariance is the product of X and Y, deviated from their 
means, its units are, awkwardly, the units of X multiplied by the units of Y. This 
“units” problem can make numerical values of the covariance difficult to interpret. 
The correlation is an alternative measure of dependence between X and Y that 
solves the “units” problem of the covariance. Specifically, the correlation between X 
and Y is the covariance between X and Y divided by their standard deviations: 
cov(X, Y) Oxy 


corr(X, Y) = var (X) var (Y) = aor (2.26) 


Because the units of the numerator in Equation (2.26) are the same as those of the 
denominator, the units cancel, and the correlation is unit free. The random variables 
X and Y are said to be uncorrelated if corr(X, Y) = 0. 

The correlation always is between —1 and 1; that is, as proven in Appendix 2.1, 


—1 Scorr(X, Y) =1 (correlation inequality). (2.27) 


Correlation and conditional mean. If the conditional mean of Y does not depend 
on X, then Y and X are uncorrelated. That is, 


if E(Y|X) = py, then cov( Y, X) = Oand corr (Y, X) = 0. (2.28) 


We now show this result. First, suppose Y and X have mean 0, so that 
cov(Y, X) = E[(Y — py) (X — py)] = E( YX). By the law of iterated expecta- 
tions [Equation (2.20)], E(YX) = E[ E(YX|X)] = E[ E(Y|X)X] = 0 because 
E(Y|X) = 0, so cov(Y, X) = 0. Equation (2.28) follows by substituting 
cov (Y, X) = 0 into the definition of correlation in Equation (2.26). If Y and X do 
not have mean 0, subtract off their means, and then the preceding proof applies. 

It is not necessarily true, however, that if X and Y are uncorrelated, then the 
conditional mean of Y given X does not depend on X. Said differently, it is possible 
for the conditional mean of Y to be a function of X but for Y and X nonetheless to 
be uncorrelated. An example is given in Exercise 2.23. 


The Mean and Variance of Sums of Random Variables 


The mean of the sum of two random variables, X and Y, is the sum of their means: 


E(X + Y) = E(X) + E(Y) = ux + py. (2.29) 
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The Distribution of Adulthood Earnings in the United 


Kingdom by Childhood Socioeconomic Circumstances 


p oliticians sometimes talk about how 


ferences in individual ability and effort. Are these 


inequality in income arises as a result of dif- 


politicians right? Or, in contrast, do childhood cir- 
cumstances affect an individual’s income during 
adulthood? For example, do children who grow up 
with fewer advantages go on to be part of house- 
holds with lower average income? 

One way to answer these questions is by con- 
sidering how an individual’s household income as 


an adult varies according to their father’s occupa- 
tional type. While no two occupations are identical, 
researchers often group similar jobs into a given 
number of meaningful classes. One method of doing 
this, as seen in the United Kingdom’s National Sta- 
tistics Socio-economic Classification (NS-SEC),} 
is grouping jobs into a hierarchy of three classes: 
higher, intermediate, and routine. 

Figure 2.4 illustrates these three conditional dis- 
tributions of household income for individuals in 


by Occupational Type of Father 


Density 


@ EN 
| FIGURE2.4 | Conditional Distributions of Household Income of U.K. individuals in 2009-2010, 
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routine jobs. 


The three distributions of household incomes are for individuals in the United Kingdom, based on the 
National Statistics Socio-economic Classification (NS-SEC) of their father—higher, intermediate, and 
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1For further details refer to “The National Statistics Socio-economic classification (NS-SEC),” The Office for National 
Statistics, https://www.ons.gov.uk/, 2010. 


NS-SEC of 
Father's Job 


Standard 
Deviation 


Summaries of the Conditional Distribution of Monthly Household Income for 
Individuals in the United Kingdom Given NS-SEC of Father's Occupation 


(a) Higher £3,149.27 £2,434.33 £1,663.33 £2,626.92 £3,973.74 £5,629.00 
(b) Intermediate 2,692.01 2,18753 1,362.44 2,23756 3,382.00 4,881.99 
(c) Routine 2,440.94 1,878.58 1,291.00 2,049.74 3,067.76 4,339.84 J 


2.3 Two Random Variables 


Percentile 


50% 
(median) 


the United Kingdom in 2009 and 2010 according to 
the NS-SEC of their father’s occupation in that indi- 
vidual’s childhood.” The lower the classification of 
paternal occupation, the more concentrated in the 
lower end of the distribution is household income 
in adulthood. 

The statistics for monthly household income for 
these individuals by NS-SEC classification are sum- 
marized in Table 2.4. For example, the mean income 
of individuals whose father’s occupation is classified 
as routine, that is, EIncome|Father’s social class = 
routine), was £2,440.94. This is over £700 less than 
that for individuals whose father’s occupation is clas- 
sified as higher, that is, E(Income|Father’s social class 

= higher), which is £3149.27. Furthermore, these 


differences are much greater at higher ends of the 


Conditional distributions were estimated from data from 
the first wave of the United Kingdom’s Understanding 
Society dataset (gathered during 2009 and 2010). More 
details are available at https://www.understandingsociety 
.ac.uk/. Individuals with missing observations are excluded. 


distribution, with the difference in income between 
these groups being over £900 at the 75th percentile 
and almost £1,300 at the 90th percentile. The stan- 
dard deviation of household income also increases 
with occupation classification, meaning that the 
spread of household income is also greater accord- 
ing to this measure. 

This information is critical when examining the 
sort of claim discussed earlier. It appears that child- 
hood circumstances may play some part in deter- 
mining an individual’s socioeconomic circumstances 
later in life. Can we say this for certain? Is there 
anything more to consider? These circumstances and 
others like a “gender gap” in earnings are an impor- 
tant aspect of the distribution of income. We revisit 


this topic in later chapters. 
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Means, Variances, and Covariances of Sums 
J 3 of Random Variables 


Let X, Y, and V be random variables; let uy and o% be the mean and variance 
of X and let oyy be the covariance between X and Y (and so forth for the other 
variables); and let a, b, and c be constants. Equations (2.30) through (2.36) follow 
from the definitions of the mean, variance, and covariance: 


E(a+ bX +cY) a bpy + cpy, (2.30) 
var(a + bY) = bo}, (2.31) 

var (aX + bY) = o% + 2aboyy + bop, (2.32) 
E(Y*) = of + wy, (2.33) 

cov(a + bX + cV, Y) = boyy + covy, (2.34) 
E(XY) = oxy + bxby, (2735) 


|corr(X, Y)| = 1 and |oyy| = Vo%o} (correlation inequality). (2.36) 


The variance of the sum of X and Y is the sum of their variances plus two times 
their covariance: 


var(X + Y) = var(X) + var(Y) + 2cov( X,Y) = 0% + of + 20yy. (2.37) 


If X and Y are independent, then the covariance is 0, and the variance of their sum 
is the sum of their variances: 


var(X + Y) = var(X) + var(Y) = o% + o$ 
(if X and Y are independent). (2.38) 
Useful expressions for means, variances, and covariances involving weighted sums of 


random variables are collected in Key Concept 2.3. The results in Key Concept 2.3 
are derived in Appendix 2.1. 
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2.4 The Normal, Chi-Squared, Student t, and 
F Distributions 


The probability distributions most often encountered in econometrics are the nor- 
mal, chi-squared, Student ¢, and F distributions. 


The Normal Distribution 


A continuous random variable with a normal distribution has the familiar bell- 
shaped probability density shown in Figure 2.5. The function defining the normal 
probability density is given in Appendix 18.1. As Figure 2.5 shows, the normal density 
with mean yw and variance g? is symmetric around its mean and has 95% of its prob- 
ability between u — 1.960 and u + 1.960. 

Some special notation and terminology have been developed for the normal 


distribution. The normal distribution with mean p and variance g? 


is expressed con- 
cisely as N(y, o°). The standard normal distribution is the normal distribution with 
mean u = 0 and variance g? = 1 and is denoted N (0, 1). Random variables that 
have a N(0, 1) distribution are often denoted Z, and the standard normal cumula- 
tive distribution function is denoted by the Greek letter ®; accordingly, 
Pr(Z < c) = ®(c), where cis a constant. Values of the standard normal cumulative 
distribution function are tabulated in Appendix Table 1. 

To look up probabilities for a normal variable with a general mean and variance, 
we must first standardize the variable. For example, suppose Y is distributed 
N(1, 4) —that is, Y is normally distributed with a mean of 1 and a variance of 4. What 
is the probability that Y = 2—that is, what is the shaded area in Figure 2.6a? The stan- 
dardized version of Y is Y minus its mean, divided by its standard deviation; that is, 
(Y-1)/ V4 = $(Y — 1). Accordingly, the random variable (Y — 1) is normally 
distributed with mean 0 and variance 1 (see Exercise 2.8); it has the standard normal 


| FIGURE2.5 | The Normal Probability Density 


The normal probability density function 

with mean pw and variance g? is a bell- 
shaped curve, centered at u. The area under 
the normal p.d.f. between  — 1.960 and 

a + 1.960 is 0.95. The normal distribution is 
denoted N( u, o°). 


95% 
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GGL ED calculating the Probability That Y = 2 When Y Is Distributed N(1, 4) 


To calculate Pr( Y = 2), standardize Y, 

then use the standard normal distribution 

table. Y is standardized by subtracting its 

mean (u = 1) and dividing by its stan- 

dard deviation (a = 2). The probability 

that Y = 2 is shown in Figure 2.6a, and the 

corresponding probability after standard- Pr(Y = 2) 
izing Y is shown in Figure 2.6b. Because the 


standardized random variable, (Y — 1) /2, N(1, 4) distribution 


is a standard normal (Z) random variable, 


Pr(Y = 2) = Pr(4 gt 254) = m 
Pr(Z < 0.5). From Appendix Table 1, 1.0 2.0 y 
Pr(Z = 0.5) = (0.5) = 0.691. fa) NA, 4) 


Pr(Z = 0.5) 


N(0, 1) distribution 


(b) N(O, 1) 


24 


Computing Probabilities and Involving Normal 
Random Variables 


Suppose Y is normally distributed with mean w and variance g°; in other words, 


Y is distributed N (u, o°). Then Y is standardized by subtracting its mean and 
dividing by its standard deviation, that is, by computing Z = (Y — p)/o. 
Let cı and cz denote two numbers with c4 < c2,and let d} = (c4 — u) /o and 
dz = (c2 = pv) /o.Then 
Pr(Y S$ cy) = Pr(Z S do) = ®(d2), (2.39) 
Pr(Y = c,) = Pr(Z = d,) = 1 — (dı), (2.40) 
Pr(cy = Y = ce) = Pr(d = Z = d2) = D(dy) — D(d,). (2.41) 


The normal cumulative distribution function ® is tabulated in Appendix Table 1. 
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distribution shown in Figure 2.6b. Now Y = 2 is equivalent to5(Y — 1) = 3(2 — 1); 
that is,5(Y — 1) = +. Thus 

Pr(Y = 2) = Pr[5(Y — 1) = 4] = Pr(Z = 5) = 0(0.5) = 0.691, (2.42) 
where the value 0.691 is taken from Appendix Table 1. 

The same approach can be used to compute the probability that a normally distrib- 
uted random variable exceeds (or is less than) some value or that it falls in a certain 
range. These steps are discussed in Key Concept 2.4. The box “The Unpegging of the 
Swiss Franc” presents an unusual application of the cumulative normal distribution. 

The normal distribution is symmetric, so its skewness is 0. The kurtosis of the 
normal distribution is 3. 


The multivariate normal distribution. The normal distribution can be generalized 
to describe the joint distribution of a set of random variables. In this case, the distri- 
bution is called the multivariate normal distribution or, if only two variables are 
being considered, the bivariate normal distribution. The formula for the bivariate 
normal p.d.f. is given in Appendix 18.1, and the formula for the general multivariate 
normal p.d.f. is given in Appendix 19.2. 

The multivariate normal distribution has four important properties. If X and Y 
have a bivariate normal distribution with covariance oyy and if a and b are two con- 


stants, then aX + bY has the normal distribution: 


aX + bY is distributed N (auy + buy, a’oy + b’a} + 2aboxy) 


(X, Y bivariate normal). 


(2.43) 


The Unpegging of the Swiss Franc 


O n Thursday, January 15, 2015, the value of 
the euro fell by 17.472% from 1.201 to 0.991 
against the Swiss franc. This was a huge shift, illus- 
trated in the downward spike in Figure 2.7, given 
that the previous year had not seen a day’s move- 
ment greater than 0.544%. If you had woken up as 
a statistical analyst for a financial company on that 
Thursday morning, how might you have estimated 
the probability of this happening that day? 

If you had assumed the data was normally dis- 
tributed, you would have required an estimate of the 
standard deviation of daily percentage change in the 


euro/Swiss franc exchange rates. Using Datastream 


data! for the year to January 14, 2015, you can esti- 
mate this as 0.112%. 

What was the probability of a drop of 17472%? 
We can first calculate the number of standard devia- 


tions that describes a change of this magnitude as 


17.472 
0.112 


mally distributed, then the estimate of the probabil- 


= 156. If the daily percentage changes are nor- 


ity of a fall at least as big as 156 standard deviations 
corresponds to an inconceivably small number— 
8.175 x 10°88 which is derived using Equation (2.39). 


'Datastream, maintained by Thomson Reuters, is a global 
financial and macroeconomic data platform that acts as a 
repository of financial and economic data. 


continued on next page 
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Percent change 
5 = 


~15 l 


20 


( GD Daily Percentage Change in the Euro/Swiss Franc Exchange Rate ) 


2014 


the unpegging of the Swiss franc on January 15, 2015. 


The day-on-day percentage change in the value of the euro in Swiss francs for a year before and a year after 


J 
2016 
Year 


L 
2015 


So was the probability of a fall at least this large 
really so small? Well, no. The error here is to not 
investigate the nature of our data further, and to fail 
to understand the actual process that determined the 
value of the currency. The Swiss franc had in fact been 
kept within very small bounds due to the actions of 
the country’s central bank in setting a so-called “peg” 
for the currency. In the previous twelve months, this 
had been within the range of 1.2008 and 1.236 Swiss 
francs per euro. In fact, the introduction of this peg 
over three years earlier had caused an appreciation 
of the euro against the Swiss franc of over 20 stan- 
dard deviations (again, assuming a normal distribu- 


tion derived from previous daily changes!).? 


It was the introduction of the peg that had caused 
such little volatility in—or such a low standard devia- 
tion of—the value of the currency. Once this peg was 
removed, as happened on that particular Thursday, 
the value of the currency was able to float and vary 
according to market factors. Investors responded to 
the removal of the peg by bidding down the value of 
the euro against the franc substantially. 

It is not only the removal of a currency peg in this 


way that can cause extreme fluctuations. The result 


See the article published in Reuters, “Charts of the Dat, 
Swiss Franc Edition,” by Felix Salmon, September 6, 2011. 
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of the 2016 “Brexit” referendum in the United 
Kingdom —an event that, while seen as unlikely, was at 
least partly foreseeable —led to an appreciation in the 
value of the euro against British pound sterling on June 
24, 2016, of 6.17%. This is equivalent to 9.80 standard 
deviations (based on data from the previous year), or 
an event with an apparent probability of 5.629 x 107”. 
While it may seem substantially more likely to occur, 
the probability of such an event actually taking place is 
less than once every 1,000,000,000,000,000,000 years 
(a total of 18 zeros)!> Again, it seems unlikely that this 


is an accurate characterization of the probability of 
such an event occurring. 

Clearly, it is dangerous to assume that data is 
normally distributed or that recent observations of a 
variable will provide a useful prediction of the range 
of future values. Indeed, it is partly for this reason 
that advertisements for financial products in the 
United Kingdom must carry a disclaimer that “past 


performance is not a guide to future performance.” 


>This is based on the assumption of 260 trading days per year. 
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More generally, if n random variables have a multivariate normal distribution, 
then any linear combination of these variables (such as their sum) is normally 
distributed. 

Second, if a set of variables has a multivariate normal distribution, then the mar- 
ginal distribution of each of the variables is normal [this follows from Equation 
(2.43) by setting a = 1 and b = Oj. 

Third, if variables with a multivariate normal distribution have covariances that equal 
0, then the variables are independent. Thus, if X and Y have a bivariate normal distribution 
and oyy = 0, then X and Y are independent (this is shown in Appendix 18.1). In 
Section 2.3, it was shown that if X and Y are independent, then, regardless of their 
joint distribution, oyy = 0. If X and Y are jointly normally distributed, then the con- 
verse is also true. This result—that 0 covariance implies independence —is a special 
property of the multivariate normal distribution that is not true in general. 

Fourth, if X and Y have a bivariate normal distribution, then the conditional expec- 
tation of Y given X is linear in X; that is, E(Y|X = x) = a + bx, where a and b are 
constants (Exercise 18.11). Joint normality implies linearity of conditional expectations, 
but linearity of conditional expectations does not imply joint normality. 
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The Chi-Squared Distribution 


The chi-squared distribution is used when testing certain types of hypotheses in sta- 
tistics and econometrics. 

The chi-squared distribution is the distribution of the sum of m squared indepen- 
dent standard normal random variables. This distribution depends on m, which is 
called the degrees of freedom of the chi-squared distribution. For example, let Z4, Z2, 
and Z; be independent standard normal random variables. Then Z? + Z3 + Z4 has 
a chi-squared distribution with 3 degrees of freedom. The name for this distribution 
derives from the Greek letter used to denote it: A chi-squared distribution with m 
degrees of freedom is denoted y%,. 

Selected percentiles of the x7, distribution are given in Appendix Table 3. For 
example, Appendix Table 3 shows that the 95th percentile of the x3 distribution is 
781, so Pr( Z} + Z3 + Z3 = 7.81) = 0.95. 


The Student t Distribution 


The Student ¢ distribution with m degrees of freedom is defined to be the distribution 
of the ratio of a standard normal random variable to the square root of an independently 
distributed chi-squared random variable with m degrees of freedom divided by m. That 
is, let Z be a standard normal random variable, let W be a random variable with a chi- 
squared distribution with m degrees of freedom, and let Z and W be independently 
distributed. Then the random variable Z / V W/m has a Student ¢ distribution (also 
called the ¢ distribution) with m degrees of freedom. This distribution is denoted f,,,. 
Selected percentiles of the Student r distribution are given in Appendix Table 2. 

The Student f distribution depends on the degrees of freedom m. Thus the 95th 
percentile of the t, distribution depends on the degrees of freedom m. The Student 
t distribution has a bell shape similar to that of the normal distribution, but it has 
more mass in the tails; that is, it is a “fatter” bell shape than the normal. When m is 
30 or more, the Student ¢ distribution is well approximated by the standard normal 
distribution, and the ¢., distribution equals the standard normal distribution. 


The F Distribution 


The F distribution with m and n degrees of freedom, denoted F,,,,,, is defined to be 
the distribution of the ratio of a chi-squared random variable with degrees of free- 
dom m, divided by m, to an independently distributed chi-squared random variable 
with degrees of freedom n, divided by n. To state this mathematically, let W be a chi- 
squared random variable with m degrees of freedom and let V be a chi-squared 
random variable with n degrees of freedom, where W and V are independently dis- 
tributed. Then ie has an F, n distribution — that is, an F distribution with numerator 
degrees of freedom m and denominator degrees of freedom n. 

In statistics and econometrics, an important special case of the F distribution 
arises when the denominator degrees of freedom is large enough that the F,,,, 


an 
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distribution can be approximated by the F,,.. distribution. In this limiting case, the 
denominator random variable V /n is the mean of infinitely many squared standard 
normal random variables, and that mean is 1 because the mean of a squared standard 
normal random variable is 1 (see Exercise 2.24). Thus the F,,.. distribution is the 
distribution of a chi-squared random variable with m degrees of freedom divided by 
m: W/m is distributed F,,, ... For example, from Appendix Table 4, the 95th percentile 
of the F» distribution is 2.60, which is the same as the 95th percentile of the X5 
distribution, 781 (from Appendix Table 2), divided by the degrees of freedom, which 
is 3(7.81/3 = 2.60). 

The 90th, 95th, and 99th percentiles of the Fn n distribution are given in Appen- 
dix Table 5 for selected values of m and n. For example, the 95th percentile of the F; 30 
distribution is 2.92, and the 95th percentile of the Æ pọ distribution is 2.71. As the 
denominator degrees of freedom n increases, the 95th percentile of the F,,, distribu- 
tion tends to the F,.., limit of 2.60. 


Random Sampling and the Distribution 
of the Sample Average 


Almost all the statistical and econometric procedures used in this text involve aver- 
ages or weighted averages of a sample of data. Characterizing the distributions of 
sample averages therefore is an essential step toward understanding the performance 
of econometric procedures. 

This section introduces some basic concepts about random sampling and the 
distributions of averages that are used throughout the book. We begin by discussing 
random sampling. The act of random sampling —that is, randomly drawing a sample 
from a larger population—has the effect of making the sample average itself a ran- 
dom variable. Because the sample average is a random variable, it has a probability 
distribution, which is called its sampling distribution. This section concludes with 
some properties of the sampling distribution of the sample average. 


Random Sampling 


Simple random sampling. Suppose our commuting student from Section 2.1 aspires 
to be a statistician and decides to record her commuting times on various days. She 
selects these days at random from the school year, and her daily commuting time has 
the cumulative distribution function in Figure 2.2a. Because these days were selected 
at random, knowing the value of the commuting time on one of these randomly 
selected days provides no information about the commuting time on another of the 
days; that is, because the days were selected at random, the values of the commuting 
time on the different days are independently distributed random variables. 

The situation described in the previous paragraph is an example of the simplest 
sampling scheme used in statistics, called simple random sampling, in which n objects are 
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Simple Random Sampling and i.i.d. Random Variables 


2.5 


In a simple random sample, n objects are drawn at random from a population, and 
each object is equally likely to be drawn. The value of the random variable Y for 
the i" randomly drawn object is denoted Y;. Because each object is equally likely 
to be drawn and the distribution of Y; is the same for all i, the random variables 
Y,,..., Y, are independently and identically distributed (i.i.d.); that is, the distri- 
bution of Y; is the same for alli = 1,..., n, and Y; is distributed independently 
of Y5,..., Y, and so forth. 


selected at random from a population (the population of commuting days) and each 
member of the population (each day) is equally likely to be included in the sample. 

The n observations in the sample are denoted Y;,..., Y,, where Y; is the first 
observation, Y, is the second observation, and so forth. In the commuting example, 
Y, is the commuting time on the first of the n randomly selected days, and Y; is the 
commuting time on the i" of the randomly selected days. 

Because the members of the population included in the sample are selected at 
random, the values of the observations Y;,..., Y, are themselves random. If differ- 
ent members of the population are chosen, their values of Y will differ. Thus the act 
of random sampling means that Y,,..., Y, can be treated as random variables. 
Before they are sampled, Y;,..., Y, can take on many possible values; after they are 
sampled, a specific value is recorded for each observation. 


i.i.d. draws. Because Y;,..., Y, are randomly drawn from the same population, the 
marginal distribution of Y;is the same for each i = 1,..., n; this marginal distribu- 
tion is the distribution of Y in the population being sampled. When Y; has the same 
marginal distribution for i = 1,..., n, then Y,,..., Y, are said to be identically 
distributed. 

Under simple random sampling, knowing the value of Y; provides no informa- 
tion about Y5, so the conditional distribution of Y, given Y; is the same as the mar- 
ginal distribution of Y,. In other words, under simple random sampling, Y, is 
distributed independently of Y5,..., Y,,. 

When Y;,..., Y, are drawn from the same distribution and are independently 
distributed, they are said to be independently and identically distributed (i.i.d.). 

Simple random sampling and i.i.d. draws are summarized in Key Concept 2.5. 


The Sampling Distribution of the Sample Average 
The sample average or sample mean, Y, of the n observations Y,, ..., Y, is 


n 
Y,. (2.44) 


= 1 
P= =(Y, + Y, ++ Yp) = 
i=1 


3je 
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An essential concept is that the act of drawing a random sample has the effect of 
making the sample average Y a random variable. Because the sample was drawn at 
random, the value of each Y; is random. Because Yj,..., Y, are random, their average 
is random. Had a different sample been drawn, then the observations and their sam- 
ple average would have been different: The value of Y differs from one randomly 
drawn sample to the next. 

For example, suppose our student commuter selected five days at random to 
record her commute times, then computed the average of those five times. Had she 
chosen five different days, she would have recorded five different times—and thus 
would have computed a different value of the sample average. 

Because Y is random, it has a probability distribution. The distribution of Y is 
called the sampling distribution of Y because it is the probability distribution associ- 
ated with possible values of Y that could be computed for different possible samples 
Vinsenta Ves 

The sampling distribution of averages and weighted averages plays a central role 
in statistics and econometrics. We start our discussion of the sampling distribution of 
Y by computing its mean and variance under general conditions on the population 
distribution of Y. 


Mean and variance of Y. Suppose that the observations Y;,..., Y, are i.i.d., and let 
uy and a4 denote the mean and variance of Y, (because the observations are i.i.d., 
the mean is the same for all i = 1,..., n, and so is the variance). When n = 2, the 
mean of the sum Y; + Y, is given by applying Equation (2.29): E(Y, + Y) = 
by + uy = 2my. Thus the mean of the sample average is E[5(Y% + %)] = 
t X 2uy = py. In general, 


E(Y) = FDE; = wy (2.45) 


The variance of Y is found by applying Equation (2.38). For example, for 
n = 2, var (Y, + Y2) = 20%, so [by applying Equation (2.32) with a = b = £ and 
cov(Y;, %) = 0], var (Y) = soy. For general n, because Y|,..., Y, are ii.d., Y; and 
Y, are independently distributed for i # j,so cov( Y, ¥;) = 0. Thus 


— 1 n 
var (Y) = var( x) 
Mizi 
1 n 1 n n 
= >Dva(¥)+ >>) Dd cov(Y¥, Y) 
N° j=] Nj=1j=1,j4i 
2 
= = (2.46) 


The standard deviation of Y is the square root of the variance, cy/ Vn. 
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Financial Diversification and Portfolios 


| he principle of diversification says that you 
can reduce your risk by holding small invest- 


ments in multiple assets, compared to putting all 
your money into one asset. That is, you shouldn’t put 
all your eggs in one basket. 

The math of diversification follows from Equa- 
tion (2.46). Suppose you divide $1 equally among n 
assets. Let Y; represent the payout in one year of $1 
invested in the i" asset. Because you invested 1/n 
dollars in each asset, the actual payoff of your portfo- 
EWN fa =e 
To keep things simple, suppose that each asset has the 


lio after one year is (Y; + Y + 


same expected payout, uy, the same variance, o*,and 
the same positive correlation, p, across assets [so that 


cov(¥;, ¥) = po’]. Then the expected payout is 


In summary, if Yj, .. 
deviation of Y are 


E(Y) = py,and for large n, the variance of the port- 
folio payout is var (Y) = po? (Exercise 2.26). Putting 
all your money into one asset or spreading it equally 
across all n assets has the same expected payout, but 
diversifying reduces the variance from g? to po’. 
The math of diversification has led to financial 
products such as stock mutual funds, in which the 
fund holds many stocks and an individual owns a 
share of the fund, thereby owning a small amount 
of many stocks. But diversification has its limits: For 
many assets, payouts are positively correlated, so 
var( Y) remains positive even if n is large. In the case 
of stocks, risk is reduced by holding a portfolio, but 
that portfolio remains subject to the unpredictable 


fluctuations of the overall stock market. 


., Y, are i.i.d., the mean, the variance, and the standard 


E(Y) = py, (2.47) 
y 2 oY 
var(Y) =ay= P and (2.48) 
= Oy 
std.dev(Y) = oy = —= (2.49) 


These results hold whatever the distribution of Y is; that is, the distribution of Y does 
not need to take on a specific form, such as the normal distribution, for Equations 
(2.47) through (2.49) to hold. 

The notation o$ denotes the variance of the sampling distribution of the sample 
average Y. In contrast, oy is the variance of each individual Y,, that is, the variance of 
the population distribution from which the observation is drawn. Similarly, o y 
denotes the standard deviation of the sampling distribution of Y. 


Sampling distribution of Y when Y is normally distributed. Suppose that Y,,..., Y, 
are i.i.d. draws from the N( uy, a) distribution. As stated following Equation (2.43), 
the sum of n normally distributed random variables is itself normally distributed. 
Because the mean of Y is uy and the variance of Y is o4-/n, this means that, if 
Y,,..., Y, are iid. draws from the N(wy, oy) distribution, then Y is distributed 
N(py, oY/n). 


2.6 
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Large-Sample Approximations 
to Sampling Distributions 


Sampling distributions play a central role in the development of statistical and econo- 
metric procedures, so it is important to know, in a mathematical sense, what the 
sampling distribution of Y is. There are two approaches to characterizing sampling 
distributions: an “exact” approach and an “approximate” approach. 

The exact approach entails deriving a formula for the sampling distribution that 
holds exactly for any value of n. The sampling distribution that exactly describes the 
distribution of Y for any n is called the exact distribution or finite-sample distribution 
of Y. For example, if Y is normally distributed and Y,,..., Y, are i.i.d., then (as dis- 
cussed in Section 2.5) the exact distribution of Y is normal with mean py and variance 
oy/n. Unfortunately, if the distribution of Y is not normal, then in general the exact 
sampling distribution of Y is very complicated and depends on the distribution of Y. 

The approximate approach uses approximations to the sampling distribution 
that rely on the sample size being large. The large-sample approximation to the sam- 
pling distribution is often called the asymptotic distribution — “asymptotic” because 
the approximations become exact in the limit that n —> ~. As we see in this section, 
these approximations can be very accurate even if the sample size is only n = 30 
observations. Because sample sizes used in practice in econometrics typically number 
in the hundreds or thousands, these asymptotic distributions can be counted on to 
provide very good approximations to the exact sampling distribution. 

This section presents the two key tools used to approximate sampling distribu- 
tions when the sample size is large: the law of large numbers and the central limit 
theorem. The law of large numbers says that when the sample size is large, Y will be 
close to uy with very high probability. The central limit theorem says that when the 
sample size is large, the sampling distribution of the standardized sample average, 
(Y — py) /%y, is approximately normal. 

Although exact sampling distributions are complicated and depend on the dis- 
tribution of Y, the asymptotic distributions are simple. Moreover —remarkably — the 
asymptotic normal distribution of (Y — uy) / 7y does not depend on the distribution 
of Y. This normal approximate distribution provides enormous simplifications and 
underlies the theory of regression used throughout this text. 


The Law of Large Numbers and Consistency 


The law of large numbers states that, under general conditions, Y will be near uy with very 
high probability when n is large. This is sometimes called the “law of averages.” When a large 
number of random variables with the same mean are averaged together, the large values 
tend to balance the small values, and their sample average is close to their common mean. 

For example, consider a simplified version of our student commuter’s experi- 
ment in which she simply records whether her commute was short (less than 
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26 


Convergence in Probability, Consistency, 
and the Law of Large Numbers 


The sample average Y converges in probability to wy (or, equivalently, Y is con- 
sistent for uy) if the probability that Y is in the range (uy — c) to (wy + c) 
becomes arbitrarily close to 1 as n increases for any constant c > 0. The conver- 
gence of Y to uy in probability is written Y ——> py. 

The law of large numbers says that if Y,,..., Y, are independently and identi- 
cally distributed with E( Y;) = uy and if large outliers are unlikely (technically if 
var (Y) = oy < ©), then Y — py. 


20 minutes) or long. Let Y; = 1 if her commute was short on the i" randomly selected 
day and Y, = Oif it was long. Because she used simple random sampling, Y;,..., Y, 
are i.i.d. Thus Y,..., Y, areii.d. draws of a Bernoulli random variable, where (from 
Table 2.2) the probability that Y; = 1 is 0.78. Because the expectation of a Bernoulli 
random variable is its success probability, E(Y,) = uy = 0.78. The sample average 
Y is the fraction of days in her sample in which her commute was short. 

Figure 2.8 shows the sampling distribution of Y for various sample sizes n. When 
n = 2 (Figure 2.8a), Y can take on only three values: 0, 4, and 1 (neither commute was 
short, one was short, and both were short), none of which is particularly close to the 
true proportion in the population, 0.78. As n increases, however (Figures 2.8b-—d), 
Y takes on more values, and the sampling distribution becomes tightly centered on py. 

The property that Y is near wy with probability increasing to 1 as n increases is 
called convergence in probability or, more concisely, consistency (see Key Con- 
cept 2.6). The law of large numbers states that under certain conditions Y converges 
in probability to wy or, equivalently, that Y is consistent for py. 

The conditions for the law of large numbers that we will use in this text are that 
Y,,..., Y, are iid. and that the variance of Y, a}, is finite. The mathematical role 
of these conditions is made clear in Section 18.2, where the law of large numbers is 
proven. If the data are collected by simple random sampling, then the i.i.d. assump- 
tion holds. The assumption that the variance is finite says that extremely large values 
of Y;—that is, outliers—are unlikely and are observed infrequently; otherwise, these 
large values could dominate Y, and the sample average would be unreliable. This 
assumption is plausible for the applications in this text. For example, because there 
is an upper limit to our student’s commuting time (she could park and walk if the 
traffic is dreadful), the variance of the distribution of commuting times is finite. 


The Central Limit Theorem 

The central limit theorem says that, under general conditions, the distribution of Y is 
well approximated by a normal distribution when n is large. Recall that the mean of 
Y is wy and its variance is oy = gẸ/n. According to the central limit theorem, when 
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————— 
| FIGURE2.8 | Sampling Distribution of the Sample Average of n Bernoulli Random Variables 
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The distributions are the sampling distributions of Y, the sample average of n independent Bernoulli random variables 
with p = Pr(Y; = 1) = 0.78 (the probability of a short commute is 78%). The variance of the sampling distribution 
of Y decreases as n gets larger, so the sampling distribution becomes more tightly concentrated around its mean, 


u = 0.78, as the sample size n increases. 
i a 


nis large, the distribution of Y is approximately N( uy, OF) . As discussed at the end 
of Section 2.5, the distribution of Y is exactly N(py, OF) when the sample is drawn 
from a population with the normal distribution N( uy, o}).The central limit theorem 
says that this same result is approximately true when n is large even if Y,,..., Y, are 
not themselves normally distributed. 

The convergence of the distribution of Y to the bell-shaped, normal approxima- 
tion can be seen (a bit) in Figure 2.8. However, because the distribution gets quite 
tight for large n, this requires some squinting. It would be easier to see the shape of 
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“a 
| FIGURE 2.9 | Distribution of the Standardized Sample Average of n Bernoulli Random 
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(d) n= 100 


The sampling distributions of Y in Figure 2.8 are plotted here after standardizing Y. Standardization centers the distri- 
butions in Figure 2.8 and magnifies the scale on the horizontal axis by a factor of Vn. When the sample size is large, 
the sampling distributions are increasingly well approximated by the normal distribution (the solid line), as predicted 
by the central limit theorem. The normal distribution is scaled so that the height of the distribution is approximately 


the same in all figures. 
; SE. 
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the distribution of Y if you used a magnifying glass or had some other way to zoom 
in or to expand the horizontal axis of the figure. 

One way to do this is to standardize Y so that it has a mean of 0 and a variance 
of 1. This process leads to examining the distribution of the standardized version of 
Y, (Y — py) /oy. According to the central limit theorem, this distribution should 
be well approximated by a N(0, 1) distribution when n is large. 

The distribution of the standardized average (Y — wy)/oy is plotted in 
Figure 2.9 for the distributions in Figure 2.8; the distributions in Figure 2.9 are exactly 
the same as in Figure 2.8, except that the scale of the horizontal axis is changed so 
that the standardized variable has a mean of 0 and a variance of 1. After this change 
of scale, it is easy to see that, if n is large enough, the distribution of Y is well approxi- 
mated by a normal distribution. 

One might ask, how large is “large enough”? That is, how large must n be for the 
distribution of Y to be approximately normal? The answer is, “It depends.” The qual- 
ity of the normal approximation depends on the distribution of the underlying Y; that 
make up the average. At one extreme, if the Y; are themselves normally distributed, 
then Y is exactly normally distributed for all n. In contrast, when the underlying Y, 
themselves have a distribution that is far from normal, then this approximation can 
require n = 30 or even more. 

This point is illustrated in Figure 2.10 for a population distribution, shown in Fig- 
ure 2.10a, that is quite different from the Bernoulli distribution. This distribution has a 
long right tail (it is skewed to the right). The sampling distribution of Y, after centering 
and scaling, is shown in Figures 2.10b-d for n = 5, 25, and 100, respectively. Although 
the sampling distribution is approaching the bell shape for n = 25, the normal approxi- 
mation still has noticeable imperfections. Byn = 100, however, the normal approxima- 
tion is quite good. In fact, for n = 100, the normal approximation to the distribution 
of Y typically is very good for a wide variety of population distributions. 

The central limit theorem is a remarkable result. While the “small n” distribu- 
tions of Y in parts b and c of Figures 2.9 and 2.10 are complicated and quite different 
from each other, the “large n” distributions in Figures 2.9d and 2.10d are simple and, 
amazingly, have a similar shape. Because the distribution of Y approaches the normal 
as n grows large, Y is said to have an asymptotic normal distribution. 

The convenience of the normal approximation, combined with its wide applica- 
bility because of the central limit theorem, makes it a key underpinning of applied 
econometrics. The central limit theorem is summarized in Key Concept 2.7 


The Central Limit Theorem 
Suppose that Yj,..., Y, are iid. with E(Y,) = wy and var (Y,) = oy, where 2./ 
0 < o% < œ. Asn— œ, the distribution of (Y — uy) /oy (where o$ = 03/7) 
becomes arbitrarily well approximated by the standard normal distribution. 
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( FIGURE 2.10 | Distribution of the Standardized Sample Average of n Draws from a Skewed 
Population Distribution 
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(d) n= 100 


The figures show sampling distributions of the standardized sample average of n draws from the skewed (asymmetric) 
population distribution shown in Figure 2.10a. When n is small (n = 5), the sampling distribution, like the population 
distribution, is skewed. But when n is large (n = 100), the sampling distribution is well approximated by a standard 
normal distribution (solid line), as predicted by the central limit theorem. The normal distribution is scaled so that the 
height of the distribution is approximately the same in all figures. 
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Summary 


1. 


The probabilities with which a random variable takes on different values are 
summarized by the cumulative distribution function, the probability distribu- 
tion function (for discrete random variables), and the probability density func- 
tion (for continuous random variables). 

The expected value of a random variable Y (also called its mean, py), 
denoted E(Y), is its probability-weighted average value. The variance of Y is 
oy = E| (Y — py)’], and the standard deviation of Y is the square root of its 
variance. 

The joint probabilities for two random variables, X and Y, are summarized by their 
joint probability distribution. The conditional probability distribution of Y given 
X = xis the probability distribution of Y, conditional on X taking on the value x. 
A normally distributed random variable has the bell-shaped probability density 
in Figure 2.5. To calculate a probability associated with a normal random vari- 
able, first standardize the variable, and then use the standard normal cumula- 


tive distribution tabulated in Appendix Table 1. 


5. Simple random sampling produces n random observations, Yj, ..., Y,,, that are 


independently and identically distributed (i.i.d.). 


6. The sample average, Y, varies from one randomly chosen sample to the next and 


thus is a random variable with a sampling distribution. If Y}, ... 


, Y, are i.i.d., then 


a. the sampling distribution of Y has mean py and variance oy = of /n; 


b. the law of large numbers says that Y converges in probability to wy; and 


c. the central limit theorem says that the standardized version of Y, 
(Y — py) /oy, has a standard normal distribution [N(0, 1) distribution] 


when nis large. 


Key Terms 


outcomes (56) 

probability (56) 

sample space (56) 

event (56) 

discrete random variable (56) 

continuous random variable (56) 

probability distribution (56) 

cumulative probability distribution (57) 

cumulative distribution function 
(c.d.f.) (57) 

cumulative distribution (57) 

Bernoulli random variable (58) 

Bernoulli distribution (58) 


probability density function (p.d.f.) (58) 
density function (58) 

density (58) 

expected value (60) 
expectation (60) 

mean (60) 

variance (61) 

standard deviation (61) 
moments of a distribution (63) 
skewness (64) 

kurtosis (64) 

outlier (64) 

leptokurtic (64) 
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r? moment (65) 

standardized random variable (65) 
joint probability distribution (65) 
marginal probability distribution (66) 
conditional distribution (66) 
conditional expectation (67) 
conditional mean (67) 

law of iterated expectations (68) 
conditional variance (69) 

Bayes’ rule (69) 

independently distributed (70) 
independent (70) 

covariance (70) 

correlation (71) 

uncorrelated (71) 

normal distribution (75) 

standard normal distribution (75) 
multivariate normal distribution (77) 
bivariate normal distribution (77) 


chi-squared distribution (80) 

Student ¢ distribution (80) 

t distribution (80) 

F distribution (80) 

simple random sampling (81) 

population (82) 

identically distributed (82) 

independently and identically 
distributed (i.i.d.) (82) 

sample average (82) 

sample mean (82) 

sampling distribution (83) 

exact (finite-sample) distribution (85) 

asymptotic distribution (85) 

law of large numbers (85) 

convergence in probability (86) 

consistency (86) 

central limit theorem (86) 

asymptotic normal distribution (89) 
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2.1 Examples of random variables used in this chapter included (a) the sex of 


the next person you meet, (b) the number of times a wireless network fails, 


(c) the time it takes to commute to school, and (d) whether it is raining or not. 


Explain why each can be thought of as random. 


2.2 Suppose that the random variables X and Y are independent and you know 


their distributions. Explain why knowing the value of X tells you nothing 


about the value of Y. 


2.3 Suppose that X denotes the amount of rainfall in your hometown during a 


randomly selected month and Y denotes the number of children born in Los 


Angeles during the same month. Are X and Y independent? Explain. 


2.4 


2.5 


2.6 


2.7 
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A math class has 100 students, and the mean student weight is 65 kg. A 
random sample of five students is selected from the class, and their average 
weight is calculated. Will the average weight of the students in the sample 
equal 65 kg? Why or why not? Use this example to explain why the sample 
average, Y,is a random variable. 


Suppose that Y;,..., Y, are iid. random variables with a N(2, 6) distribution. 
Sketch the probability density of Y when n = 2. Repeat this for n = 15 and 
n = 200. Describe how the densities differ. What is the relationship between 
your answers and the law of large numbers? 


Suppose that Yi,..., Y, are i.i.d. random variables with probability distribu- 
tion given in Figure 2.10a. You want to calculate Pr(Y < 0.2). Would it be 
reasonable to use normal approximation if n = 8? How about when n = 30 
and n = 150? Explain. 


Y is a random variable with wy = 0; ay = 1,skewness = 0, and kurtosis = 90. 
Sketch a hypothetical probability distribution of Y. Explain why n random 
variables drawn from this distribution might have some large outliers. 


Exercises 


2.1 


2.2 


2.3 


2.4 


Let Y denote the number of “heads” that occur when two coins are tossed. 
Assume the probability of a heads is 0.4 on either coin. 

a. Derive the probability distribution of Y. 

b. Derive the mean and variance of Y. 

Use the probability distribution given in Table 2.2 to compute (a) E(Y) and 
E(X); (b) c% and a4; and (c) oyy and corr(X, Y). 


Using the random variables X and Y from Table 2.2, consider two new ran- 
dom variables, W = 4 + 8X and V = 11 — 2Y.Compute (a) E(W) and E(V); 
(b) of and of; and (c) owy and corr(W, V). 


Suppose X is a Bernoulli random variable with Pr(X = 1) = p. 
a. Show E(X*) = p. 
b. Show E(X*) = p fork > 0. 


c. Suppose that p = 0.53. Compute the mean, variance, skewness, and kur- 
tosis of X. (Hint: You might find it helpful to use the formulas given in 
Exercise 2.21.) 
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2.5 


2.6 


2.7 


2.8 


In July, Lugano’s, a city in Switzerland, daily high temperature has a mean of 
65°F and a standard deviation of 5°F What are the mean, standard deviation, 
and variance in degrees Celsius? 


The following table gives the joint probability distribution between employ- 
ment status and college graduation among those either employed or looking 
for work (unemployed) in the working-age population of South Africa. 


Unemployed Employed 

(Y= 0) (¥'= 1) 
Non-college grads (X = 0) 0.078 0.673 0.751 
College grads (X = 1) 0.042 0.207 0.249 
Total 0.12 0.88 1.000 


a. Compute E(Y). 

b. The unemployment rate is the fraction of the labor force that is unem- 
ployed. Show that the unemployment rate is given by 1 — E(Y). 

c Calculate E(Y|X = 1) and E(Y|X = 0). 

d. Calculate the unemployment rate for (i) college graduates and (ii) non- 
college graduates. 


e. A randomly selected member of this population reports being unem- 
ployed. What is the probability that this worker is a college graduate? 
A non-college graduate? 

f. Are educational achievement and employment status independent? 
Explain. 


In a given population of two-earner male-female couples, male earnings have 
a mean of $50,000 per year and a standard deviation of $15,000. Female earn- 
ings have a mean of $48,000 per year and a standard deviation of $13,000. 
The correlation between male and female earnings for a couple is 0.90. Let C 
denote the combined earnings for a randomly selected couple. 

a. What is the mean of C? 

b. What is the covariance between male and female earnings? 

c. What is the standard deviation of C? 

d. Convert the answers to (a) through (c) from U.S. dollars ($) to euros (€). 


The random variable Y has a mean of 4 and a variance of §. Let Z = 3(Y — 4). 
Find the mean and the variance of Z. 
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2.9 X and Y are discrete random variables with the following joint distribution: 


Value of Y 


Value of X 6 0.10 0.06 0.15 0.03 0.02 


That is, Pr(X = 3, Y = 2) = 0.04, and so forth. 


a. Calculate the probability distribution, mean, and variance of Y. 


b. Calculate the probability distribution, mean, and variance of Y given 
X = 6. 


c. Calculate the covariance and correlation between X and Y. 
2.10 Compute the following probabilities: 


a. If Y is distributed N(4, 9), find Pr (Y = 5). 

b. If Y is distributed N(5, 16), find Pr (Y > 2). 

c. If Y is distributed N(1, 4), find Pr(2 = Y <5). 
d. If Y is distributed N(2, 1), find Pr(1 = Y = 4). 


2.11 Compute the following probabilities: 


a. If Y is distributed x4, find Pr(Y = 6.25). 
b. If Y is distributed y4, find Pr(Y = 15.51). 
. If Y is distributed Fg .., find Pr( Y = 1.94). 
. Why are the answers to (b) and (c) the same? 


e. If Y is distributed yj, find Pr( Y =< 0.5). (Hint: Use the definition of the 
Xt distribution.) 


a eA 


2.12 Compute the following probabilities: 


. If Y is distributed t,,, find Pr( Y = 1.36). 

. If Y is distributed f39, find Pr(—1.70 = Y = 1.70). 

c. If Y is distributed N(0, 1), find Pr(—1.70 = Y = 1.70). 

d. When do the critical values of the normal and the t distribution coincide? 
e. If Y is distributed F, 11, find Pr( Y > 3.36). 

f. If Y is distributed F, 51, find Pr( Y > 4.87). 


7 S 


2.13 X is a Bernoulli random variable with Pr(X = 1) = 0.90; Y is distrib- 
uted N(0, 4); W is distributed N(0, 16); and X, Y, and W are independent. 
Let S = XY + (1 — X)W. (That is, S = Y when X = 1, and S = W when 
X = 0.) 
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2.14 


2.15 


2.16 


2.17 


2.18 


a. Show that E(Y*) = 4 and E(W”) = 16. 

b. Show that E( Y°) = 0 and E(W°) = 0. (Hint: What is the skewness for 
a symmetric distribution?) 

c. Show that E(Y*) = 3 x 4 and E(W*) = 3 x 16°. (Hint: Use the fact 
that the kurtosis is 3 for a normal distribution.) 

d. Derive E(S), E(S*), E(S°), and E(S*). (Hint: Use the law of iterated 
expectations conditioning on X = 0 and X = 1.) 


e. Derive the skewness and kurtosis for S. 


In a population, wy = 50 and a} = 21. Use the central limit theorem to 
answer the following questions: 


a. In arandom sample of size n = 50, find Pr(Y < 51). 
b. In arandom sample of size n = 150, find Pr(Y > 49). 
c. Inarandom sample of size n = 45, find Pr(50.5 < Y < 51). 


Suppose Y, I = 1,2,..., are i.i.d. random variables, each distributed 
N(20, 4). 


a. Compute Pr(19.6 < Y < 20.4) when (i) n = 25, (ii) n = 100, and 
(iii) n = 800. 

b. Suppose c is a positive number. Show that Pr(20 — c < Y < 20 + c) 
becomes close to 1.0 as n grows large. 


c Use your answer in (b) to argue that Y converges in probability to 20. 


Y is distributed N(10, 100) and you want to calculate Pr( Y = 5.8). Unfor- 
tunately, you do not have your textbook, and do not have access to a normal 
probability table like Appendix Table 1. However, you do have your computer 
and a computer program that can generate i.i.d. draws from the N(10, 100) 
distribution. Explain how you can use your computer to compute an accurate 
approximation for Pr( Y = 5.8). 


Y,i =1,..., n, are iid. Bernoulli random variables with p = 0.6. Let Y 
denote the sample mean. 


a. Use the central limit theorem to compute approximations for 
i. Pr( Y = 0.64) whenn = 50. 
ii. Pr(Y < 0.56) when n = 200. 


b. How large would n need to be to ensure that Pr(0.65 > Y > 0.55) = 0.95? 
(Use the central limit theorem to compute an approximate answer.) 


In any year, the weather can inflict storm damage to a home. From year to 
year, the damage is random. Let Y denote the dollar value of damage in any 
given year. Suppose that in 95% of the years Y = $0, but in 5% of the years 
Y = $30,000. 


2.19 


2.20 


2.21 


2.22 
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a. What are the mean and standard deviation of the damage in any year? 


b. Consider an “insurance pool” of 120 people whose homes are sufficiently 
dispersed so that, in any year, the damage to different homes can be 
viewed as independently distributed random variables. Let Y denote the 
average damage to these 120 homes in a year. (i) What is the expected 
value of the average damage Y? (ii) What is the probability that Y 
exceeds $3,000? 


Consider two random variables, X and Y. Suppose that Y takes on k values 
Yı, ---, yk and that X takes on / values x1,..., X;. 


a. Show that Pr(Y = y;) = D/_,Pr(¥ = y,|X = x;) Pr(X = x;). [Hint: 
Use the definition of Pr( Y = y;|X = x;).] 
b. Use your answer to (a) to verify Equation (2.19). 


c. Suppose that X and Y are independent. Show that oyy = 0 and 
corr(X, Y) = 0. 


Consider three random variables, X, Y, and Z. Suppose that Y takes on k 
values y,..., Yg; that X takes on / values x;,... , x; and that Z takes 
on m values z4, ..., Zm. The joint probability distribution of X, Y, Z is 
Pr(X = x, Y = y, Z = z), and the conditional probability distribution of Y 


given X and Z is Pr(Y = y|X = x, Z = z) SF AS 2 


a. Explain how the marginal probability that Y = y can be calculated 
from the joint probability distribution. |Hint: This is a generalization of 
Equation (2.16).] 

b. Show that E(Y) = E[E(Y|X, Z) ]. [Hint: This is a generalization of 
Equations (2.19) and (2.20).] 


X is a random variable with moments E(X), E(X?), E(X°), and so forth. 
a. Show E(X — u)? = E(X?) — 3[ E(X”) ][E(X)] + 2[E(X)). 


b. Show 
E(X- p) = E(X*) — 4[E(X) ][E(X°)] + 6[E(X)]*[E(X7)] — 3[E(x)]*. 


Suppose you have some money to invest, for simplicity $1, and you are plan- 
ning to put a fraction w into a stock market mutual fund and the rest, 1 — w, 
into a mutual fund. Suppose that $1 invested in a stock fund yields R, after 
one year and that $1 invested in mutual fund yields. R,. Suppose that R, is 
random with mean 0.06 and standard deviation 0.09, and suppose that R, is 
random with mean 0.04 and standard deviation 0.05. The correlation between 
R, and œR; is 0.3. If you place a fraction w of your money in the stock fund 
and the rest, 1 — w,in the mutual fund, then the return on your investment is 
R= wk, + (1 — w)R. 


a. Suppose that w = 0.2. Compute the mean and standard deviation of R. 
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2.23 


2.24 


2.25 


2.26 


b. Suppose that w = 0.8. Compute the mean and standard deviation of R. 


c. What value of w makes the mean of R as large as possible? What is the 
standard deviation of R for this value of w? 


d. (Harder) What is the value of w that minimizes the standard deviation 


of R? (Show using a graph, algebra, or calculus.) 


This exercise provides an example of a pair of random variables, X and Y, for 
which the conditional mean of Y given X depends on X but corr(X, Y) = 0. 
Let X and Z be two independently distributed standard normal random vari- 
ables, and let Y = X? + Z. 

a. Show that E(Y|X) = X?. 

b. Show that wy = 1. 


c. Show that E(XY) = 0. (Hint: Use the fact that the odd moments of a 
standard normal random variable are all 0.) 


d. Show that cov(X, Y) = 0 and thus corr(X, Y) = 0. 
Suppose Y;is distributed iid. N(0, 07) fori = 1, 2,..., n 
a. Show that E(Y7/o7) = 1. 

b. Show that W = 7 /o’)d}_,Y7 is distributed x2. 

c. Show that E(W) = n. a ~ your answer to (a).] 


d. Show that V = Y; 4G, Bin “ is distributed f,, _. 


(Review of summation notation) Let x,,..., x,, denote a sequence of num- 


bers; yj,..., Yn denote another sequence of numbers; and a, b, and c denote 
three constants. Show that 


n n 
a. Sax; = a> Xj, 
i=1 i=1 
(xi + yi) = $r ig Èy 


Me 


b. 


ll 
far 


M 


ll 
an 


a =n X a,and 


S 
M= 


n n n 
(a + bx; + cyi)? = na? + ES x + coy, + 2ab X xi 
i=1 i=1 i=1 
n 
2ac y; + 2c > i 


Suppose that Y}, Y», ..., Y, are random variables with a common mean py; 


ll 
ma 


a common variance oy; and the same correlation p (so that the correlation 
between Y; and Yis equal to p for all pairs i and j, where i # j). 


a. Show that cov( Y, Y,) = poy fori # j. 
b. Suppose that n = 2. Show that E(Y) = uy and var (Y) = ło% + Spay. 
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c. Forn = 2,show that E(Y) = my and var (Y) = of /n + 
[(n — 1) /n]po%. 
d. When n is very large, show that var (Y) ~ poy. 


2.27 Consider the problem of predicting Y using another variable, X, so that the 
prediction of Y is some function of X, say g(X). Suppose that the quality of 
the prediction is measured by the squared prediction error made on average 
over all predictions, that is, by E{[Y — g(X) ]?}. This exercise provides a 
non-calculus proof that of all possible prediction functions g, the best predic- 
tion is made by the conditional expectation, E(Y|X). 


a. Let Y = E(Y|X),and let u = Y — Y denote its prediction error. Show 
that E(u) = 0. (Hint: Use the law of iterated expectations.) 


b. Show that E(uX) = 0. 

c. Let Y = g(X) denote a different prediction of Y using X, and let 
v = Y — Y denote its error. Show that E[(Y — Y)?] > E[ (Y - Ŷ)?]. 
[Hint: Let h(X) = g(X) — E(Y|X), so that v = [Y — E(Y|X)]—h(X). 
Derive E(v?).] 


2.28 Refer to Part B of Table 2.3 for the conditional distribution of the number of 
network failures M given network age A. Let Pr(A = 0) = 0.5; that is, you 
work in your room 50% of the time. 


a. Compute the probability of three network failures, Pr( M = 3). 
b. Use Bayes’ rule to compute Pr(A = 0| M = 3). 


c. Now suppose you work in your room one-fourth of the time, so 
Pr(A = 0) = 0.25. Use Bayes’ rule to compute Pr(A = 0|M = 3). 


Empirical Exercise 


E2.1 On the text website, http://www.pearsonglobaleditions.com, you will find the 
spreadsheet Age_HourlyEarnings, which contains the joint distribution of 
age (Age) and average hourly earnings (AHE) for 25- to 34-year-old full-time 
workers in 2015 with an education level that exceeds a high school diploma. 
Use this joint distribution to carry out the following exercises. (Note: For these 
exercises, you need to be able to carry out calculations and construct charts 
using a spreadsheet.) 


a. Compute the marginal distribution of Age. 

b. Compute the mean of AHE for each value of Age; that is, compute, 
E(AHE|Age = 25), and so forth. 

c. Compute and plot the mean of AHE versus Age. Are average hourly 
earnings and age related? Explain. 
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d. Use the law of iterated expectations to compute the mean of AHE; that 
is, compute E(AHE). 

Compute the variance of AHE. 

Compute the covariance between AHE and Age. 


Compute the correlation between AHE and Age. 


> t m © 


Relate your answers in (f) and (g) to the plot you constructed in (c). 


APPENDIX 


2.1 Derivation of Results in Key Concept 2.3 


This appendix derives the equations in Key Concept 2.3. 

Equation (2.30) follows from the definition of the expectation. 

To derive Equation (2.31), use the definition of the variance to write var(a + bY) = 
E{[a + bY — E(a + bY) }?} = E{[b(Y - uy) 2} = BELY - py)?] = bo}. 


To derive Equation (2.32), use the definition of the variance to write 


var (aX + bY) = E{[| (aX + bY) — (aux + buy) ]*} 

= E{[a(X — py) + b(Y - py) ]’} 

= E[a (X — py)*] + 2E[ab(X - px)(Y - py)] 
+ E[b’(Y — py)?] 
a’var (X) + 2ab cov(X, Y) + b? var(Y) 


aox + 2aboyy + bo}, (2.50) 


where the second equality follows by collecting terms, the third equality follows by expanding 
the quadratic, and the fourth equality follows by the definition of the variance and covariance. 


To derive Equation (2.33), write 


E(Y*) = E{(Y — py) + py]?} = E[(Y — py)?] + 2 wyE(Y — py) + wy = oy + wy 


because E(Y — py) = 0. 
To derive Equation (2.34), use the definition of the covariance to write 


cov(at+ bX + cV, Y) = E{[a+ bX + cV — E(at+ bX + cV)][Y — py]} 
= E{[b(X — wx) + eV — py) I[Y — my]} 
= E{[b(X — wx) I[Y — wy]} + E{le(V — wv) ITY — uy]} 
= boyy + covy, (2.51) 


which is Equation (2.34). 
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To derive Equation (2.35), write 


E(XY) 


E{[(X — px) + ex] (Y — uy) + wy]} 
E| (X — ux)(Y — uy)] + uxE(Y — py) + wyE(X — py) + uxty 


= Oyy + Pxby- 


We now prove the correlation inequality in Equation (2.36); that is, |corr(X, Y) < 1.| 


Let a = —oyy/o% and b = 1. Applying Equation (2.32), we have, 


var (aX + Y ) = o% + a} + 2aoxy 
= (-oxy/o%)*o% + oF + 2(-oxy/o%) oxy 


oy — oxy / ox. (2.52) 


Because var(aX + Y) is a variance, it cannot be negative, so from the final line of Equa- 


tion (2.52), it must be that of — oXy/o% = 0. Rearranging this inequality yields 
oxy = oyoy (covariance inequality). (2.53) 
The covariance inequality implies that oy /(oXo}) = 1 or, equivalently, |oyy/(oyoy)| = 1, 


which (using the definition of the correlation) proves the correlation inequality, 
|corr(X Y)| = 1. 


The Conditional Mean as the Minimum 
Mean Squared Error Predictor 


At a general level, the statistical prediction problem is, how does one best use the information 
in a random variable X to predict the value of another random variable Y? 

To answer to this question, we must first make precise mathematically what it means for 
one prediction to be better than another. A common way to do so is to consider the cost of 
making a prediction error. This cost, which is called the prediction loss, depends on the mag- 
nitude of the prediction error. For example, if your job is to predict sales so that a production 
supervisor can develop a production schedule, being off by a small amount is unlikely to 
inconvenience customers or to disrupt the production process. But if you are off by a large 
amount and production is set far too low, your company might lose customers who need to 
wait a long time to receive a product they order, or if production is far too high, the company 
will have costly excess inventory on its hands. In either case, a large prediction error can be 


disproportionately more costly than a small one. 
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One way to make this logic precise is to let the cost of a prediction error depend on the 
square of that error, so an error twice as large is four times as costly. Specifically, suppose that 
your prediction of Y, given the random variable X, is g(X). The prediction error is Y — g(X), 


and the quadratic loss associated with this prediction is, 
Loss = E{{Y — g(X)}*}. (2.54) 


We now show that, of all possible functions g(X), the loss in Equation (2.54) is minimized 
by g(X) = E(Y|X). We show this result using discrete random variables, however this result 
extends to continuous random variables. The proof here uses calculus; Exercise 2.27 works 
through a non-calculus proof of this result. 

First consider the simpler problem of finding a number, m, that minimizies E[ (Y — m)?]. 
From the definition of the expectation, E[ (Y — m)?] = 54 (Y; — m)?p;.To find the value 
of m that minimizes E[ (Y — m)?], take the derivative of > (Y, — m)*p; with respect to 


m and set it to zero: 


d & k k k 
ma m)?pi = 250- mp, = -2( > pi- mX) 


i=1 i=l 
k 
= -( $ vp, — m) = 0, (2.55) 


where the final equality uses the fact that probabilities sum to 1. It follow from the final equality in 
Equation (2.55) that the squared error prediction loss is minimized by m = S Ypi = E(Y), 
that is, by setting m equal to the mean of Y. 

To find the predictor g(X) that minimizes the loss in Equation (2.54), use the law of iterated 
expectations to write that loss as, Loss = E{[Y — g(X)]?} = E(E{[Y — (X) PIXŅ}). 
Thus, if the function g(X) minimizes E{[Y — g(X) ]*|X = x} for each value of x, it mini- 
mizes the loss in Equation (2.54). But for a fixed value X = x, g(X) = g(x) is a fixed number, 
so this problem is the same as the one just solved, and the loss is minimized by choosing g(x) 
to be the mean of Y, given X = x. This is true for every value of x. Thus the squared error loss 
in Equation (2.54) is minimzed by g(X) = E(Y|X). 


Review of Statistics 


—— is the science of using data to learn about the world around us. Statistical 
tools help us answer questions about unknown characteristics of distributions in 
populations of interest. For example, what is the mean of the distribution of earnings 
of recent college graduates? Do mean earnings differ for men and women and, if so, 
by how much? 

These questions relate to the distribution of earnings in the population of workers. 
One way to answer these questions would be to perform an exhaustive survey of the 
population of workers, measuring the earnings of each worker and thus finding the 
population distribution of earnings. In practice, however, such a comprehensive survey 
would be extremely expensive. Comprehensive surveys that do exist, also known as 
censuses, are often undertaken periodically (for example, every ten years in India, the 
United States of America and the United Kingdom). This is because the process of con- 
ducting a census is an extraordinary commitment, consisting of designing census 
forms, managing and conducting surveys, and compiling and analyzing data. Censuses 
across the world have a long history, with accounts of censuses recorded by Babylo- 
nians in 4000 Bc. According to historians, censuses have been conducted as far back as 
Ancient Rome; the Romans would track the population by making people return to 
their birthplace every year in order to be counted.’ In England and other parts of 
Wales, a notable census was the Domesday Book, which was compiled in 1086 by 
William the Conqueror. The U.K. census in its current form dates back to 1801 after 
essays by economist Thomas Malthus (1798) inspired parliament to want to accurately 
know the size of the population. Over time the census has evolved from amounting to 
a mere headcount to the much more ambitious survey of the 2011 U.K. census costing 
an estimated £482 million. In India, there are accounts of censuses recorded around 
300 Bc, but the census in its current form has been undertaken since 1872 and every 
ten years since 1881. In comparison to the U.K. census of 2011, the most recent census 
of India, also conducted in 2011, approximately cost a mere ¥2200 crore (US$320 million)! 
Despite the considerable efforts made to ensure that the census records all individuals, 
many people slip through the cracks and are not surveyed. Thus a different, more 
practical approach is needed. 

The key insight of statistics is that one can learn about a population distribution by 
selecting a random sample from that population. Rather than survey the entire popu- 
lation of China (1.4 billion in 2018), we might survey, say, 1000 members of the popu- 
lation, selected at random by simple random sampling. Using statistical methods, we 


'Source: Office for National Statistics, https://www.ons.gov.uk, accessed on August 23, 2018. 
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3.1 


can use this sample to reach tentative conclusions—to draw statistical inferences— 
about characteristics of the full population.” 

Three types of statistical methods are used throughout econometrics: estimation, 
hypothesis testing, and confidence intervals. Estimation entails computing a “best 
guess” numerical value for an unknown characteristic of a population distribution, 
such as its mean, from a sample of data. Hypothesis testing entails formulating a 
specific hypothesis about the population and then using sample evidence to decide 
whether it is true. Confidence intervals use a set of data to estimate an interval or 
range for an unknown population characteristic. Sections 3.1, 3.2, and 3.3 review 
estimation, hypothesis testing, and confidence intervals in the context of statistical 
inference about an unknown population mean. 

Most of the interesting questions in economics involve relationships between two or 
more variables or comparisons between different populations. For example, is there a gap 
between the mean earnings for male and female recent college graduates? In Section 3.4, 
the methods for learning about the mean of a single population in Sections 3.1 through 
3.3 are extended to compare means in two different populations. Section 3.5 discusses 
how the methods for comparing the means of two populations can be used to estimate 
causal effects in experiments. Sections 3.2 through 3.5 focus on the use of the normal dis- 
tribution for performing hypothesis tests and for constructing confidence intervals when 
the sample size is large. In some special circumstances, hypothesis tests and confidence 
intervals can be based on the Student t distribution instead of the normal distribution; 
these special circumstances are discussed in Section 3.6. The chapter concludes with a 
discussion of the sample correlation and scatterplots in Section 3.7. 


Estimation of the Population Mean 


Suppose you want to know the mean value of Y (that is, wy) in a population, such as 
the mean earnings of women recently graduated from college. A natural way to esti- 
mate this mean is to compute the sample average Y from a sample of n indepen- 
dently and identically distributed (i.i.d.) observations, Y,,..., Y, (recall that 
Y,,..., Y, are iid. if they are collected by simple random sampling). This section 
discusses estimation of uy and the properties of Y as an estimator of uy. 


Estimators and Their Properties 


Estimators. The sample average Y is a natural way to estimate py, but it is not the 
only way. For example, another way to estimate uy is simply to use the first 
observation, Y;. Both Y and Y; are functions of the data that are designed to estimate 
by; using the terminology in Key Concept 3.1, both are estimators of wy. When 
evaluated in repeated samples, Y and Y; take on different values (they produce 


Estimates of the ‘live’ population of China can be found here using the ‘official’ China Population Clock: 
http://data.stats.gov.cn/english/ 
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Estimators and Estimates 


3.1 


An estimator is a function of a sample of data to be drawn randomly from a popu- 
lation. An estimate is the numerical value of the estimator when it is actually com- 
puted using data from a specific sample. An estimator is a random variable because 
of randomness in selecting the sample, while an estimate is a nonrandom number. 


different estimates) from one sample to the next. Thus the estimators Y and Y, both 
have sampling distributions. There are, in fact, many estimators of wy, of which Y and 
Y, are two examples. 

There are many possible estimators, so what makes one estimator “better” than 
another? Because estimators are random variables, this question can be phrased 
more precisely: What are desirable characteristics of the sampling distribution of an 
estimator? In general, we would like an estimator that gets as close as possible to the 
unknown true value, at least in some average sense; in other words, we would like the 
sampling distribution of an estimator to be as tightly centered on the unknown value 
as possible. This observation leads to three specific desirable characteristics of an 
estimator: unbiasedness (a lack of bias), consistency, and efficiency. 


Unbiasedness. Suppose you evaluate an estimator many times over repeated ran- 
domly drawn samples. It is reasonable to hope that, on average, you would get the 
right answer. Thus a desirable property of an estimator is that the mean of its sam- 
pling distribution equals py; if so, the estimator is said to be unbiased. 

To state this concept mathematically, let y denote some estimator of uy, such 
as Y or Yi. [The caret (^) notation will be used throughout this text to denote an 
estimator, so {iy is an estimator of uy.] The estimator Êy is unbiased if E (fy) = py, 
where E(j1y) is the mean of the sampling distribution of Ay; otherwise, jy is biased. 


Bias, Consistency, and Efficiency 


22 


Let fiy be an estimator of my. Then: 


e The bias of py is E(py) — By. 
e fy is an unbiased estimator of py if E( ûy) = py. 
e {iy is a consistent estimator of uy if fy —2> py. 


e Let py be another estimator of wy, and suppose that both fy and py are 
unbiased. Then ġy is said to be more efficient than ñy if var (y) < var( ñy). 
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Consistency. Another desirable property of an estimator uy is that when the sample 
size is large, the uncertainty about the value of wy arising from random variations in 
the sample is very small. Stated more precisely, a desirable property of jy is that the 
probability that it is within a small interval of the true value uy approaches 1 as the 
sample size increases; that is, Qy is consistent for wy (Key Concept 2.6). 


Variance and efficiency. Suppose you have two candidate estimators, fiy and py, 
both of which are unbiased. How might you choose between them? One way to do 
so is to choose the estimator with the tightest sampling distribution. This suggests 
choosing between fiy and py by picking the estimator with the smallest variance. If 
ty has a smaller variance than jy, then Ay is said to be more efficient than jy. The 
terminology “efficiency” stems from the notion that if fy has a smaller variance than 
Ly, then it uses the information in the data more efficiently than does py. 

Bias, consistency, and efficiency are summarized in Key Concept 3.2. 


Properties of Y 


How does Y fare as an estimator of wy when judged by the three criteria of bias, 
consistency, and efficiency? 


Bias and consistency. The sampling distribution of Y has already been examined in 
Sections 2.5 and 2.6. As shown in Section 2.5, E(Y) = py,so Y is an unbiased esti- 
mator of uy. Similarly, the law of large numbers (Key Concept 2.6) states that 
Y —> py; that is, Y is consistent. 


Efficiency. What can be said about the efficiency of Y? Because efficiency entails a 
comparison of estimators, we need to specify the estimator or estimators to which Y 
is to be compared. 

We start by comparing the efficiency of Y to the estimator Y;. Because Yj,... ,¥, 
are i.i.d., the mean of the sampling distribution of Y; is E(Y,) = py; thus Y; is an 
unbiased estimator of wy. Its variance is var(Y,) = oY. From Section 2.5, the vari- 
ance of Y is oy/ n. Thus, for n = 2, the variance of Y is less than the variance of Yj; 
that is, Y is a more efficient estimator than Y}, so, according to the criterion of effi- 
ciency, Y should be used instead of Y,. The estimator Y, might strike you as an obvi- 
ously poor estimator—why would you go to the trouble of collecting a sample of 
n observations only to throw away all but the first? —and the concept of efficiency 
provides a formal way to show that Y is a more desirable estimator than Y}. 

What about a less obviously poor estimator? Consider the weighted average in 
which the observations are alternately weighted by + and 3: 


ee a 3 1 3 1 3 
= + + + tires + Y,a t+ l 
Y 1(2y uU oe z Yni n). (3.1) 


where the number of observations n is assumed to be even for convenience. The 
mean of Y is uy, and its variance is var( Y) = 1.25 0}-/n (Exercise 3.11). Thus Y is 
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Efficiency of Y: Y Is BLUE 
Let fy be an estimator of uy that is a weighted average of Y,,..., Yp; that is, 3 2 
fiy = (1/n)>’;_, wY;, where a;,..., a, are nonrandom constants. If fy is unbi- 


ased, then var(Y) < var(fy) unless fy = Y. Thus Y is the Best Linear Unbi- 
ased Estimator (BLUE); that is, Y is the most efficient estimator of wy among all 
unbiased estimators that are weighted averages of Y,,... , Yp. 


unbiased, and because var( Y) —Oasn— ~, Y is consistent. However, Y has a 
larger variance than Y.Thus Y is more efficient than Y. 

The estimators Y, Y,, and Y have a common mathematical structure: They are 
weighted averages of Y,,..., Y,. The comparisons in the previous two paragraphs 
show that the weighted averages Y, and Y have larger variances than Y. In fact, these 
conclusions reflect a more general result: Y is the most efficient estimator of all 
unbiased estimators that are weighted averages of Y;,..., Y,. Said differently, Y is 
the Best Linear Unbiased Estimator (BLUE); that is, it is the most efficient (best) 
estimator among all estimators that are unbiased and are linear functions of 
Y,,..., Yp. This result is stated in Key Concept 3.3 and is proved in Chapter 5. 


Y is the least squares estimator of uy. The sample average Y provides the best fit to 
the data in the sense that the average squared differences between the observations 
and Y are the smallest of all possible estimators. 

Consider the problem of finding the estimator m that minimizes 


5 (Y; - m)?, (3.2) 


which is a measure of the total squared gap or distance between the estimator m and 
the sample points. Because m is an estimator of E(Y), you can think of it as a predic- 
tion of the value of Y, so the gap Y; — m can be thought of as a prediction mistake. 
The sum of squared gaps in Expression (3.2) can be thought of as the sum of squared 
prediction mistakes. 

The estimator m that minimizes the sum of squared gaps Y; — m in Expression (3.2) 
is called the least squares estimator. One can imagine using trial and error to solve 
the least squares problem: Try many values of m until you are satisfied that you have 
the value that makes Expression (3.2) as small as possible. Alternatively, as is done 
in Appendix 3.2, you can use algebra or calculus to show that choosing m = Y mini- 
mizes the sum of squared gaps in Expression (3.2), so that Y is the least squares 
estimator of uy. 
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| n 2009, India’s general elections, also referred to 
as the national elections, was the largest demo- 
cratic election in the world until the Indian general 
elections 2014 held from April 7 2014. Shortly before 
the general elections, pollsters predicted a close fight 
between the coalition parties—the United Progressive 
Alliance (UPA) and the National Democratic Alliance 
(NDA). Psephologists envisaged that while the UPA 
might have had the upper hand, the NDA could not 
be written off. They predicted that the UPA would get 
between 201 and 235 seats in the 14th Lok Sabha (the 


lower house of India’s bicameral Parliament) and the 


What could be the possible reasons for opinion 
polls being wide off the mark? In countries that do not 
have a homogenous population, such as India, caste, 
religion, and geographies influence electoral outcomes 
greatly. Vulnerable sections of the population may 
be afraid to disclose their actual preference. Political 
polls have since become much more sophisticated and 
adjust for sampling bias, but they still can make mis- 
takes. If opinion polls do not randomly select samples 
across various locations and sections of people, they 


may still not hit the mark. 


Source: Atul Thakur, “Why Opinion Polls Are Often Wide 


NDA between 165 and 186 seats. The actual results ; à : 

off the Mark,” The Times of India, April 13, 2014. 
were surprising: UPA got 262 seats, while NDA could 
only manage to get 157 seats. 


The Importance of Random Sampling 


We have assumed that Yj,.. 
obtained from simple random sampling. This assumption is important because non- 


., Y, are i.i.d. draws, such as those that would be 


random sampling can result in Y being biased. Suppose that to estimate the monthly 
national unemployment rate, a statistical agency adopts a sampling scheme in which 
interviewers survey working-age adults sitting in city parks at 10 a.m. on the second 
Wednesday of the month. Because most employed people are at work at that hour 
(not sitting in the park!), the unemployed are overly represented in the sample, and 
an estimate of the unemployment rate based on this sampling plan would be biased. 
This bias arises because this sampling scheme overrepresents, or oversamples, the 
unemployed members of the population. This example is fictitious, but the 
“Off the Mark!” box gives a real-world example of biases introduced by sampling 
that is not entirely random. 

It is important to design sample selection schemes in a way that minimizes bias. 
Appendix 3.1 includes a discussion of what the Bureau of Labor Statistics actually 
does when it conducts the U.S. Current Population Survey (CPS), the survey it uses 
to estimate the monthly U.S. unemployment rate. 
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Hypothesis Tests Concerning 
the Population Mean 


Many hypotheses about the world around us can be phrased as yes/no questions. 
Do the mean hourly earnings of recent U.S. college graduates equal $20 per hour? Are 
mean earnings the same for male and female college graduates? Both these questions 
embody specific hypotheses about the population distribution of earnings. The statisti- 
cal challenge is to answer these questions based on a sample of evidence. This section 
describes hypothesis tests concerning the population mean (Does the population 
mean of hourly earnings equal $20?). Hypothesis tests involving two populations (Are 
mean earnings the same for men and women?) are taken up in Section 3.4. 


Null and Alternative Hypotheses 


The starting point of statistical hypotheses testing is specifying the hypothesis to be 
tested, called the null hypothesis. Hypothesis testing entails using data to compare 
the null hypothesis to a second hypothesis, called the alternative hypothesis, that 
holds if the null does not. 

The null hypothesis is that the population mean, E(Y), takes on a specific value, 
denoted uyo. The null hypothesis is denoted Ho and thus is 


Ho: E(Y) = by,o- (3.3) 


For example, the conjecture that, on average in the population, college graduates 
earn $20 per hour constitutes a null hypothesis about the population distribution of 
hourly earnings. Stated mathematically, if Y is the hourly earnings of a randomly 
selected recent college graduate, then the null hypothesis is that E(Y) = 20; that is, 
uyo = 20 in Equation (3.3). 

The alternative hypothesis specifies what is true if the null hypothesis is not. The 
most general alternative hypothesis is that E(Y) # py, o, which is called a two-sided 
alternative hypothesis because it allows E(Y) to be either less than or greater than 
uy, The two-sided alternative is written as 


Hı: E(Y) # puyo (two-sided alternative). (3.4) 


One-sided alternatives are also possible, and these are discussed later in this 
section. 

The problem facing the statistician is to use the evidence in a randomly selected 
sample of data to decide whether to accept the null hypothesis Hp or to reject it in 
favor of the alternative hypothesis H4. If the null hypothesis is “accepted,” this does 
not mean that the statistician declares it to be true; rather, it is accepted tentatively 
with the recognition that it might be rejected later based on additional evidence. For 
this reason, statistical hypothesis testing can be posed as either rejecting the null 
hypothesis or failing to do so. 
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The p-Value 


In any given sample, the sample average Y will rarely be exactly equal to the hypoth- 
esized value uyo. Differences between Y and uyo can arise because the true mean, in 
fact, does not equal uy (the null hypothesis is false) or because the true mean equals 
uyo (the null hypothesis is true) but Y differs from uyo because of random sampling. 
It is impossible to distinguish between these two possibilities with certainty. Although 
a sample of data cannot provide conclusive evidence about the null hypothesis, it is 
possible to do a probabilistic calculation that permits testing the null hypothesis in a 
way that accounts for sampling uncertainty. This calculation involves using the data 
to compute the p-value of the null hypothesis. 

The p-value, also called the significance probability, is the probability of drawing 
a Statistic at least as adverse to the null hypothesis as the one you actually computed 
in your sample, assuming the null hypothesis is correct. In the case at hand, the 
p-value is the probability of drawing Y at least as far in the tails of its distribution 
under the null hypothesis as the sample average you actually computed. 

For example, suppose that, in your sample of recent college graduates, the aver- 
age wage is $22.64. The p-value is the probability of observing a value of Y at least as 
different from $20 (the population mean under the null hypothesis) as the observed 
value of $22.64 by pure random sampling variation, assuming that the null hypothesis 
is true. If this p-value is small (say, 0.1%), then it is very unlikely that this sample 
would have been drawn if the null hypothesis is true; thus it is reasonable to conclude 
that the null hypothesis is not true. By contrast, if this p-value is large (say, 40%), then 
it is quite likely that the observed sample average of $22.64 could have arisen just by 
random sampling variation if the null hypothesis is true; accordingly, the evidence 
against the null hypothesis is weak in this probabilistic sense, and it is reasonable not 
to reject the null hypothesis. 

To state the definition of the p-value mathematically, let Y^“ denote the value of 
the sample average actually computed in the data set at hand, and let Pry, denote the 
probability computed under the null hypothesis (that is, computed assuming that 
E(Y) = uy o). The p-value is 


p-value = Pry,[ |Y - uyo | > |¥** — uyo |]. (3.5) 


That is, the p-value is the area in the tails of the distribution of Y under the null 
hypothesis beyond uyo + |Y"“ — pyo|. If the p-value is large, then the observed 
value Y““ is consistent with the null hypothesis, but if the p-value is small, it is not. 
To compute the p-value, it is necessary to know the sampling distribution of Y 
under the null hypothesis. As discussed in Section 2.6, when the sample size is small, 
this distribution is complicated. However, according to the central limit theorem, 
when the sample size is large, the sampling distribution of Y is well approximated by 
a normal distribution. Under the null hypothesis the mean of this normal distribution 
is uyo, SO under the null hypothesis Y is distributed N( uyo, oy) , where oy = 0}/n. 
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This large-sample normal approximation makes it possible to compute the p-value 
without needing to know the population distribution of Y, as long as the sample size 
is large. The details of the calculation, however, depend on whether o% is known. 


Calculating the p-Value When oy Is Known 


The calculation of the p-value when oy is known is summarized in Figure 3.1. If the 
sample size is large, then under the null hypothesis the sampling distribution of Y is 
N(uyo, oy), Where of = o}/n. Thus, under the null hypothesis, the standardized 
version of Y, (Y — uyo) /ay, has a standard normal distribution. The p-value is the 
probability of obtaining a value of Y farther from uyo than Y^“ under the null 
hypothesis or, equivalently, it is the probability of obtaining (Y — uyo) /ay greater 
than (Y““ — pyo) /oy in absolute value. This probability is the shaded area shown 
in Figure 3.1. Written mathematically, the shaded tail probability in Figure 3.1 (that 


J-a) o9 


where ® is the standard normal cumulative distribution function. That is, the p-value 


is, the p-value) is 


act 
= Myo 
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p-value = Praf 


is the area in the tails of a standard normal distribution outside + | Y°“ — pyo|/ory. 

The formula for the p-value in Equation (3.6) depends on the variance of the 
population distribution, a4. In practice, this variance is typically unknown. [An 
exception is when Y; is binary, so that its distribution is Bernoulli, in which case the 
variance is determined by the null hypothesis; see Equation (2.7) and Exercise 3.2.] 
Because in general of must be estimated before the p-value can be computed, we 
now turn to the problem of estimating ay. 


(er : >) 
| FIGURE 3.1 | Calculating a p-value 
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The Sample Variance, Sample Standard Deviation, 


and Standard Error 


The sample variance, sj, is an estimator of the population variance, a+; the sample 


standard deviation, sy, is an estimator of the population standard deviation, oy; and 
the standard error of the sample average, Y, is an estimator of the standard deviation 
of the sampling distribution of Y. 


The sample variance and standard deviation. The sample variance, s4, is 


X(x- Y)’. (3.7) 


The sample standard deviation, sy, is the square root of the sample variance. 

The formula for the sample variance is much like the formula for the population 
variance. The population variance, E( Y — my)”, is the average value of (Y — py)? 
in the population distribution. Similarly, the sample variance is the sample average 
of (Y, — uy)’, i = 1,---, n, with two modifications: First, wy is replaced by Y, and 
second, the average uses the divisor n — 1 instead of n. 

The reason for the first modification—replacing wy by Y —is that wy is unknown 
and thus must be estimated; the natural estimator of uy is Y. The reason for 
the second modification—dividing by n — 1 instead of by n—is that estimating py 
by Y introduces a small downward bias in (Y, — Y)’. Specifically, as is shown 
in Exercise 3.18, E[ (Y; - Y)?] = [(n -1)/n]o}. Thus E>"_,(¥; - Y)?= 
nE| (Y, — Y)*] = (n — 1)o4. Dividing by n — 1 in Equation (3.7) instead of n 
corrects for this small downward bias, and as a result s} is unbiased. 


Dividing by n — 1 in Equation (3.7) instead of n is called a degrees of freedom 
correction: Estimating the mean uses up some of the information—that is, uses up 1 
“degree of freedom” —in the data, so that only n — 1 degrees of freedom remain. 


Consistency of the sample variance. The sample variance is a consistent estimator 
of the population variance: 


s > o}. (3.8) 


In other words, the sample variance is close to the population variance with high 
probability when n is large. 

The result in Equation (3.9) is proven in Appendix 3.3 under the assumptions 
that Y,,..., Y, are iid. and Y, has a finite fourth moment; that is, E(Y}) < ~. 
Intuitively, the reason that s} is consistent is that it is a sample average, so s} obeys 
the law of large numbers. For sẸ to obey the law of large numbers in Key Concept 2.6, 
(Y, — wy)? must have finite variance, which in turn means that E( Y$) must be finite; 
in other words, Y; must have a finite fourth moment. 
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The Standard Error of Y 
The standard error of Y is an estimator of the standard deviation of Y. The 34 
standard error of Y is denoted SE(Y) or ôy. When Y;,..., Y, are i.i.d., 


SE(Y) = Gy = sy/ Vn. (3.9) 


The standard error of Y. Because the standard deviation of the sampling distribution 
of Y is oy = oy/ Vn, Equation (3.9) justifies using sy/ Vn as an estimator of oy. 
The estimator of oy, sy / Vn, is called the standard error of Y and is denoted SE( Y) 
or ĉy. The standard error of Y is summarized as in Key Concept 3.4. 

When Y,,...,Y, are i.i.d. draws from a Bernoulli distribution with success 
probability p, the formula for the variance of Y simplifies to p(1 — p)/n (see 
Exercise 3.2). The formula for the standard error also takes on a simple form that 


depends only on Y and n: SE(Y) = VY(1 — Y)/n. 


Calculating the p-Value When oy Is Unknown 


Because s¥ is a consistent estimator of o}, the p-value can be computed by replacing 


oy in Equation (3.6) by the standard error, SE(Y) = Gy. That is, when oy is 


unknown and Y;,..., Y, are i.i.d., the p-value is calculated using the formula 
ya _ 
p-value = 20( Satz ) (3.10) 
SE(Y) 


The t-Statistic 


The standardized sample average (Y — py) /SE(Y) plays a central role in testing 
statistical hypotheses and has a special name, the t-statistic or f-ratio: 


Y — uyo 

SE(Y) 
In general, a test statistic is a statistic used to perform a hypothesis test. The t-statistic 
is an important example of a test statistic. 


Large-sample distribution of the t-statistic. When n is large, s} is close to of with 
high probability. Thus the distribution of the t-statistic is approximately the same as 
the distribution of (Y — wyo) /ay, which in turn is well approximated by the 
standard normal distribution when n is large because of the central limit theorem 
(Key Concept 2.7). Accordingly, under the null hypothesis, 


tis approximately distributed N(0, 1) for large n. (3.12) 
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The formula for the p-value in Equation (3.10) can be rewritten in terms of the 
t-statistic. Let £^“ denote the value of the t-statistic actually computed: 
ails Myo 


pr = SE) (3.13) 


Accordingly, when n is large, the p-value can be calculated using 


p-value = 2®(- |£“ |). (3.14) 


As a hypothetical example, suppose that a sample of n = 200 recent college 
graduates is used to test the null hypothesis that the mean wage, E( Y), is $20 per 
hour. The sample average wage is Y^“ = $22.64, and the sample standard deviation 
is sy = $18.14. Then the standard error of Y is sy/Vn = 18.14/ V200 = 1.28. The 
value of the t-statistic is t£““ = (22.64 — 20) /1.28 = 2.06. From Appendix Table 1, 
the p-value is 2®(—2.06) = 0.039, or 3.9%. That is, assuming the null hypothesis to 
be true, the probability of obtaining a sample average at least as different from the 
null as the one actually computed is 3.9%. 


Hypothesis Testing with a Prespecified Significance Level 


When you undertake a statistical hypothesis test, you can make two types of mistakes: 
You can incorrectly reject the null hypothesis when it is true, or you can fail to reject the 
null hypothesis when it is false. Hypothesis tests can be performed without computing 
the p-value if you are willing to specify in advance the probability you are willing to toler- 
ate of making the first kind of mistake — that is, of incorrectly rejecting the null hypoth- 
esis when it is true. If you choose a prespecified probability of rejecting the null hypothesis 
when it is true (for example, 5%), then you will reject the null hypothesis if and only if 
the p-value is less than 0.05. This approach gives preferential treatment to the null 
hypothesis, but in many practical situations, this preferential treatment is appropriate. 


Hypothesis tests using a fixed significance level. Suppose it has been decided 
that the hypothesis will be rejected if the p-value is less than 5%. Because the area 
under the tails of the standard normal distribution outside + 1.96 is 5%, this gives a 
simple rule: 


Reject Ho if |r| > 1.96. (3.15) 


That is, reject if the absolute value of the t-statistic computed from the sample is 
greater than 1.96. If n is large enough, then under the null hypothesis the t-statistic 
has a N(0, 1) distribution. Thus the probability of erroneously rejecting the null 
hypothesis (rejecting the null hypothesis when it is, in fact, true) is 5%. 

This framework for testing statistical hypotheses has some specialized 
terminology, summarized in Key Concept 3.5. The significance level of the test in 
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The Terminology of Hypothesis Testing 


A statistical hypothesis test can make two types of mistakes: a type I error, 
in which the null hypothesis is rejected when in fact it is true; and a type H error, in 
which the null hypothesis is not rejected when in fact it is false. The prespecified 
rejection probability of a statistical hypothesis test when the null hypothesis is 
true—that is, the prespecified probability of a type I error—is the significance 
level of the test. The critical value of the test statistic is the value of the statistic 
for which the test just rejects the null hypothesis at the given significance level. 
The set of values of the test statistic for which the test rejects the null hypothesis 
is the rejection region, and the set of values of the test statistic for which it does 
not reject the null hypothesis is the acceptance region. The probability that the test 
actually incorrectly rejects the null hypothesis when it is true is the size of the test, 
and the probability that the test correctly rejects the null hypothesis when the 
alternative is true is the power of the test. 

The p-value is the probability of obtaining a test statistic, by random sampling 
variation, at least as adverse to the null hypothesis value as is the statistic actually 
observed, assuming that the null hypothesis is correct. Equivalently, the p-value is 
the smallest significance level at which you can reject the null hypothesis. 


3.9 


Equation (3.15) is 5%, the critical value of this two-sided test is 1.96, and the rejection 
region is the values of the t-statistic outside + 1.96. If the test rejects at the 5% 
significance level, the population mean wuy is said to be statistically significantly dif- 
ferent from uyg at the 5% significance level. 

Testing hypotheses using a prespecified significance level does not require 
computing p-values. In the previous example of testing the hypothesis that the mean 
earnings of recent college graduates is $20 per hour, the t-statistic was 2.06. This value 
exceeds 1.96, so the hypothesis is rejected at the 5% level. Although performing the test 
with a 5% significance level is easy, reporting only whether the null hypothesis is rejected 
at a prespecified significance level conveys less information than reporting the p-value. 


What significance level should you use in practice? This is a question of active 
debate. Historically, statisticians and econometricians have used a 5% significance 
level. If you were to test many statistical hypotheses at the 5% level, you would incor- 
rectly reject the null, on average, once in 20 cases. Whether this is a small number 
depends on how you look at it. If only a small fraction of all null hypotheses tested 
are, in fact, false, then among those tests that reject, the probability of the null actu- 
ally being false can be small (Exercise 3.22). This probability —the fraction of incor- 
rect rejections among all rejections —is called the false positive rate. The false positive 
rate can have great practical importance. For example, for newly reported statistically 
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Testing the Hypothesis E(Y) = uyo 


3.6 


Against the Alternative E(Y) # uyo 


1. Compute the standard error of Y, SE( Y) [Equation (3.8)]. 
2. Compute the t-statistic [Equation (3.13)]. 


3. Compute the p-value [Equation (3.14)]. Reject the hypothesis at the 5% 
significance level if the p-value is less than 0.05 (equivalently, if | 2°“ | > 1.96). 


significant findings of effective medical treatments, it is the fraction for which the 
treatment is in fact ineffective. Concern that the false positive rate can be high when 
the 5% significance level is used has led some statisticians to recommend using 
instead a 0.5% significance level when reporting new results (Benjamin et al., 2017). 
Similar concerns can apply in a legal setting, where justice might aim to keep the 
fraction of false convictions low. Using a 0.5% significance level leads to two-sided 
rejection when the t-statistic exceeds 2.81 in absolute value. In such cases, a p-value 
between 0.05 and 0.005 can be viewed as suggestive, but not conclusive, evidence 
against the null that merits further investigation. 

The choice of significance level requires judgment and depends on the applica- 
tion. In some economic applications, a false positive might be less of a problem than 
in a medical context, where the false positive could lead to patients receiving ineffec- 
tive treatments. In such cases, a 5% significance level could be appropriate. 

Whatever the significance level, it is important to keep in mind that p-values are 
designed for tests of a null hypothesis, so they, like ¢-statistics, are useful only when 
the null hypothesis itself is of interest. This section uses the example of earnings. 
Even though many interns are unpaid, nobody thinks that, on average, workers earn 
nothing at all, so the null hypothesis that earnings are zero is economically uninter- 
esting and not worth testing. In contrast, the null hypothesis that the mean earnings 
of men and of women are the same is interesting and of societal importance, and that 
null hypothesis is examined in Section 3.4. 

Key Concept 3.6 summarizes hypothesis tests for the population mean against 
the two-sided alternative. 


One-Sided Alternatives 


In some circumstances, the alternative hypothesis might be that the mean exceeds uyo. 
For example, one hopes that education helps in the labor market, so the relevant alterna- 
tive to the null hypothesis that earnings are the same for college graduates and non- 
college graduates is not just that their earnings differ, but rather that graduates earn more 
than nongraduates. This is called a one-sided alternative hypothesis and can be written 


Hı: E(Y) > myo (one-sided alternative ). (3.16) 
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The general approach to computing p-values and to hypothesis testing is the same for 
one-sided alternatives as it is for two-sided alternatives, with the modification that only 
large positive values of the f-statistic reject the null hypothesis rather 
than values that are large in absolute value. Specifically, to test the one-sided hypothesis 
in Equation (3.16), construct the t-statistic in Equation (3.13). The p-value is the area 
under the standard normal distribution to the right of the calculated t-statistic. That is, 
the p-value, based on the N(0, 1) approximation to the distribution of the t-statistic, is 


p-value = Pry (Z > t“) = 1 —-— p(t“). (3.17) 


The N(0, 1) critical value for a one-sided test with a 5% significance level is 1.64. The 
rejection region for this test is all values of the t-statistic exceeding 1.64. 

The one-sided hypothesis in Equation (3.16) concerns values of py exceed- 
ing uyo. If instead the alternative hypothesis is that E(Y) < uyo, then the discussion 
of the previous paragraph applies except that the signs are switched; for example, the 
5% rejection region consists of values of the t-statistic less than —1.64. 


Confidence Intervals 
for the Population Mean 


Because of random sampling error, it is impossible to learn the exact value of the 
population mean of Y using only the information in a sample. However, it is possible 
to use data from a random sample to construct a set of values that contains the true 
population mean py with a certain prespecified probability. Such a set is called a 
confidence set, and the prespecified probability that wy is contained in this set is 
called the confidence level. The confidence set for uy turns out to be all the possible 
values of the mean between a lower and an upper limit, so that the confidence set is 
an interval, called a confidence interval. 

Here is one way to construct a 95% confidence set for the population mean. 
Begin by picking some arbitrary value for the mean; call it uyo. Test the null hypoth- 
esis that wy = py against the alternative that uy # myo by computing the t-statistic; 
if its absolute value is less than 1.96, this hypothesized value jy9 is not rejected at the 
5% level, so write down this nonrejected value jy 9. Now pick another arbitrary value 
of uyo and test it; if you cannot reject it, write down this value on your list. Do this 
again and again; indeed, do so for all possible values of the population mean. Con- 
tinuing this process yields the set of all values of the population mean that cannot be 
rejected at the 5% level by a two-sided hypothesis test. 

This list is useful because it summarizes the set of hypotheses you can and cannot 
reject (at the 5% level) based on your data: If someone walks up to you with a spe- 
cific number in mind, you can tell him whether his hypothesis is rejected or not 
simply by looking up his number on your handy list. A bit of clever reasoning shows 
that this set of values has a remarkable property: The probability that it contains the 
true value of the population mean is 95%. 
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Confidence Intervals for the Population Mean 


5 


A 95% two-sided confidence interval for uy is an interval constructed so that it 
contains the true value of uy in 95% of all possible random samples. When the 
sample size n is large, 90%, 95%, and 99% confidence intervals for uy are: 


90% confidence interval for uy = {Y + 1.64SE(Y)}, 
95% confidence interval for uy = {Y + 1.96SE(Y)}, and 
99% confidence interval for uy = {Y + 2.58SE(Y)}. 


The clever reasoning goes like this: Suppose the true value of wy is 21.5 (although 
we do not know this). Then Y has a normal distribution centered on 21.5, and the 
t-statistic testing the null hypothesis wy = 21.5 has a N(0, 1) distribution. Thus, if n is 
large, the probability of rejecting the null hypothesis wy = 21.5 at the 5% level is 5%. 
But because you tested all possible values of the population mean in constructing your 
set, in particular you tested the true value, wy = 21.5. In 95% of all samples, you will 
correctly accept 21.5; this means that in 95% of all samples, your list will contain the 
true value of wy. Thus the values on your list constitute a 95% confidence set for py. 

This method of constructing a confidence set is impractical, for it requires you to 
test all possible values of wy as null hypotheses. Fortunately, there is a much easier 
approach. According to the formula for the t-statistic in Equation (3.13), a trial value 
of wy is rejected at the 5% level if it is more than 1.96 standard errors away from Y. 
Thus the set of values of uy that are not rejected at the 5% level consists of those 
values within+1.96SE(Y) of Y; that is, a 95% confidence interval for py is 
Y — 1.96SE(Y) < py = Y + 1.96SE(Y). Key Concept 3.7 summarizes this 
approach. 

As an example, consider the problem of constructing a 95% confidence interval 
for the mean hourly earnings of recent college graduates using a hypothetical 
random sample of 200 recent college graduates where Y = $22.64 and 
SE(Y) = 1.28. The 95% confidence interval for mean hourly earnings is 
22.64 + 1.96 X 1.28 = 22.64 £2.51 = ($20.13, $25.15). 

This discussion so far has focused on two-sided confidence intervals. One could 
instead construct a one-sided confidence interval as the set of values of uy that can- 
not be rejected by a one-sided hypothesis test. Although one-sided confidence inter- 
vals have applications in some branches of statistics, they are uncommon in applied 
econometric analysis. 


Coverage probabilities. The coverage probability of a confidence interval for the 
population mean is the probability, computed over all possible random samples, that 
it contains the true population mean. 
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3.4 Comparing Means from Different 
Populations 


Do recent male and female college graduates earn the same amount on average? 
Answering this question involves comparing the means of two different population 
distributions. This section summarizes how to test hypotheses and how to construct 
confidence intervals for the difference in the means from two different populations. 


Hypothesis Tests for the Difference Between 
Two Means 


To illustrate a test for the difference between two means, let u, be the mean hourly 
earnings in the population of women recently graduated from college, and let u, be 
the population mean for recently graduated men. Consider the null hypothesis that 
mean earnings for these two populations differ by a certain amount, say, dy. Then the 
null hypothesis and the two-sided alternative hypothesis are 


Ho: Um — by = do vs. H: Um — Mw * do. (3.18) 


The null hypothesis that men and women in these populations have the same mean 
earnings corresponds to Hy in Equation (3.18) with dy = 0. 

Because these population means are unknown, they must be estimated from 
samples of men and women. Suppose we have samples of n,, men and n, women 
drawn at random from their populations. Let the sample average annual earnings be 
Y,, for men and Y, for women. Then an estimator of um — My iS Yn — Y,. 

To test the null hypothesis that um — Hw = do using Y,, — Y,,, we need to know 
the sampling distribution of Y,, — Y,,. Recall that Y, is, according to the central limit 
theorem, approximately distributed N ( um, 07, /Nm), Where o?, is the population 
variance of earnings for men. Similarly, Y, is approximately distributed 
N (bys 0» /Ny),; Where o%, is the population variance of earnings for women. Also, 
recall from Section 2.4 that a weighted average of two normal random variables is 
itself normally distributed. Because Y,, and Y, are constructed from different 
randomly selected samples, they are independent random variables. Thus Y,,, — Y, is 
distributed N[ un — by, (07,/Mm) + (07,/ny)]. 

If v7, and o%, are known, then this approximate normal distribution can be used 
to compute p-values for the test of the null hypothesis that u, — uy, = do. In prac- 
tice, however, these population variances are typically unknown, so they must be 
estimated. As before, they can be estimated using the sample variances, s?, and s2, 
where s2, is defined as in Equation (3.7), except that the statistic is computed only for 
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the men in the sample, and så, is defined similarly for the women. Thus the standard 
error of Y,, — Y, is 


— = 2 2 
SE(Yn — Yp) = fom + (3.19) 
m Ww 


For a simplified version of Equation (3.19) when Y is a Bernoulli random variable, 
see Exercise 3.15. 

The t-statistic for testing the null hypothesis is constructed analogously to the 
t-statistic for testing a hypothesis about a single population mean, by subtracting 
the null hypothesized value of wm — Hw from the estimator Y, — Y, and dividing the 
result by the standard error of Y,, — Y,,: 


t= ea (t-statistic for comparing two means). (3.20) 


If both n,, and n,, are large, then this t-statistic has a standard normal distribution 
when the null hypothesis is true. 

Because the t-statistic in Equation (3.20) has a standard normal distribution 
under the null hypothesis when n,n and n,, are large, the p-value of the two-sided test 
is computed exactly as it was in the case of a single population. That is, the p-value is 
computed using Equation (3.14). 

To conduct a test with a prespecified significance level, simply calculate the 
t-statistic in Equation (3.20), and compare it to the appropriate critical value. For 
example, the null hypothesis is rejected at the 5% significance level if the absolute 
value of the t-statistic exceeds 1.96. 

If the alternative is one-sided rather than two-sided (that is, if the alternative is that 
Um — Hw > dp), then the test is modified as outlined in Section 3.2. The p-value is com- 
puted using Equation (3.17), and a test with a 5% significance level rejects when t > 1.64. 


Confidence Intervals for the Difference Between 
Two Population Means 


The method for constructing confidence intervals summarized in Section 3.3 extends 
to constructing a confidence interval for the difference between the means, 
d = um — by. Because the hypothesized value dọ is rejected at the 5% level if 
|t| > 1.96, do will be in the confidence set if |t| = 1.96. But |t| = 1.96 means that 
the estimated difference, Y,, — Y,, is less than 1.96 standard errors away from dp. 
Thus the 95% two-sided confidence interval for d consists of those values of 
d within +1.96 standard errors of Y,, — Y,: 


m 
95% confidence interval for d = um — Mw is 
(Y,, — Y,) + 1.96SE(Y,, — Y,). (3.21) 
With these formulas in hand, the box “Social Class or Education? Childhood Circum- 


stances and Adult Earnings Revisited” contains an empirical investigation of differ- 
ences in earnings of different households in the United Kingdom. 
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Differences-of-Means Estimation of Causal 
Effects Using Experimental Data 


Recall from Section 1.2 that a randomized controlled experiment randomly selects 
subjects (individuals or, more generally, entities) from a population of interest, then 
randomly assigns them either to a treatment group, which receives the experimental 
treatment, or to a control group, which does not receive the treatment. The difference 
between the sample means of the treatment and control groups is an estimator of the 
causal effect of the treatment. 


The Causal Effect as a Difference of Conditional 
Expectations 


The causal effect of a treatment is the expected effect on the outcome of interest of 
the treatment as measured in an ideal randomized controlled experiment. This effect 
can be expressed as the difference of two conditional expectations. Specifically, the 
causal effect on Y of treatment level x is the difference in the conditional expecta- 
tions, E(Y|X = x) — E(Y|X = 0),where E(Y|X = x) is the expected value of Y 
for the treatment group (which receives treatment level X = x) in an ideal random- 
ized controlled experiment and E(Y|X = 0) is the expected value of Y for the 
control group (which receives treatment level X = 0). In the context of experiments, 
the causal effect is also called the treatment effect. If there are only two treatment 
levels (that is, if the treatment is binary), then we can let X = 0 denote the control 
group and X = 1 denote the treatment group. If the treatment is binary, then the 
causal effect (that is, the treatment effect) is E(Y|X = 1) — E(Y|X = 0) in an 
ideal randomized controlled experiment. 


Estimation of the Causal Effect Using 
Differences of Means 


If the treatment in a randomized controlled experiment is binary, then the causal 
effect can be estimated by the difference in the sample average outcomes between 
the treatment and control groups. The hypothesis that the treatment is ineffective is 
equivalent to the hypothesis that the two means are the same, which can be tested 
using the t-statistic for comparing two means, given in Equation (3.20). A 95% con- 
fidence interval for the difference in the means of the two groups is a 95% confidence 
interval for the causal effect, so a 95% confidence interval for the causal effect can 
be constructed using Equation (3.21). 

A well-designed, well-run experiment can provide a compelling estimate of a 
causal effect. For this reason, randomized controlled experiments are commonly con- 
ducted in some fields, such as medicine. In economics, however, experiments tend to 
be expensive, difficult to administer, and, in some cases, ethically questionable, so 
they are used less often. For this reason, econometricians sometimes study “natural 
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Social Class or Education? Childhood Circumstances 


and Adult Earnings Revisited 


T 


by Childhood Socioeconomic Circumstances” sug- 


he box in Chapter 2 “The Distribution of 
Adulthood Earnings in the United Kingdom 


gests that when an individual’s father has a “routine” 
occupation, the individual, as an adult, goes on to 
live in a household with lower average income. 

Are there any other factors that affect it? Yes, 
it is possible that there are relevant intermediate 
factors like education. It is generally hypothesized 
and observed that more education is associated 
with greater income, which will allow individuals to 
increase their contribution to household income. 

Table 3.1 breaks down the differences in mean 
household income for individuals according to their 
father’s NS-SEC occupation type, and considers these 
differences for selected highest level of educational 
qualification. These categories include those with no 
qualifications, those whose highest qualification level 
is GCSE (exams generally taken at age 16), those 
whose highest educational qualification is A-Level 
(exams generally taken at age 18), and those with an 


undergraduate degree or higher. For simplicity, only 


individuals whose father’s NS-SEC occupational cat- 
egory was either the highest (“higher”) or the lowest 
(“routine”) are included in this analysis. 

The data shows that, as expected, within both 
groups according to the NS-SEC of a father’s occu- 
pation, those with higher qualifications are part of 
households with higher total income. The income 
gap between those with qualifications of at least 
one degree and those with no qualifications stands 
at £146738 where the father’s NS-SEC category 
is higher, and at a comparable £1527.98 where the 
father’s NS-SEC category is routine. 

It is interesting to note the differences between 
mean income by the father’s occupational categori- 
zation (Y, — Y,) for each of the educational group- 
ings. For instance, individuals with no qualifications 
whose father’s NS-SEC job categorization was 
higher are part of households with a mean income 
of £2223.13 while for the classification routine 
this value stood at £1842.98. This implies a differ- 
ence in means of £380.15, with a standard error of 
V/2115.127/1129 + 1487.292/6383 = £65.64 with 


P 
WIZARD Differences in Household Income According to Childhood Socioeconomic 
Circumstances, Grouped by Level of Highest Qualification 
Father's NS-SEC = Higher Father’s NS-SEC = Routine Difference, Higher vs. Routine 

95% Confidence 

Qualification Yh Sh Nha Ye S n,  Yh— Y, SE(Y, — Y,) Interval ford 

None £2,223.13 -£2,115.12 1129 £1,842.98 £1,48729 6383 £380.15 £65.64 £251.38 £508.93 

GCSE/O-Level £2,83718 £1,819.73 1962 £2,596.93 £1,738.47 4042 £240.25 £49.35 £143.49 £337.00 

A-Level £3,045.99 £2,451.81 1216 £2,745.70 £1,912.50 1169 £300.30 £89.85 £124.11 £476.49 

Undergraduate £3,690.51 £2,743.55 4359 £3,370.96 £2,443.58 2505 £319.55 £64.11 £193.86 £445.23 

degree or more 

All categories £3,215.71 £2,497.73 8666 £2405.45 £1,886.86 14099 £810.25 £31.18 £749.13 £871.38 

Source: Understanding Society. 


A 
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a 95% confidence interval of (£251.38, £508.93). It is 
worth noting the difference in income, pooling these 
educational categories together, between those 
whose father’s NS-SEC categorization is “higher” and 
those where this categorization is lower is £810.25. 
The results in the table suggest a difference in com- 
position by educational attainment of these groupings 
according to the father’s NS-SEC category. When 
broken down in this way, however, the estimated dif- 
ference for every qualification level is substantially 
lower than £810.25. All of these estimated differences 
are significantly different from zero. 

This empirical analysis suggests that levels 


of education do play some part in explaining the 


difference in household income according to the 
socioeconomic status of the father. However, does 
this analysis tell us the full story? Are individu- 
als with higher levels of education likely to be in 
households with more than one earner? Does the 
difference in household income arise from an indi- 
vidual’s own contribution to household income or, 
if the individual is cohabiting, also from her or his 
partner’s contribution to household income? Is this 
relationship affected by changing patterns of edu- 
cational attainment that are correlated with age? 
We will examine questions such as these further 
once we have introduced the basics of multivariate 


regression in later chapters. 
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experiments,” also called quasi-experiments, in which some event unrelated to the 
treatment or subject characteristics has the effect of assigning different treatments to 
different subjects as if they had been part of a randomized controlled experiment. 
The box “A Way to Increase Voter Turnout” provides an example of such a quasi- 
experiment that yielded some surprising conclusions. 


3.6 Using the t-Statistic When the Sample 


Size Is Small 


In Sections 3.2 through 3.5, the t-statistic is used in conjunction with critical values 
from the standard normal distribution for hypothesis testing and for the construction 
of confidence intervals. The use of the standard normal distribution is justified by the 
central limit theorem, which applies when the sample size is large. When the sample 
size is small, the standard normal distribution can provide a poor approximation to 
the distribution of the t-statistic. If, however, the population distribution is itself nor- 
mally distributed, then the exact distribution (that is, the finite-sample distribution; 
see Section 2.6) of the t-statistic testing the mean of a single population is the Student 
t distribution with n — 1 degrees of freedom, and critical values can be taken from 
the Student ¢ distribution. 
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A Way to Increase Voter Turnout 


Av” among citizens toward political partici- 
pation, especially in voting, has been noted in 


the United Kingdom and other democratic coun- 
tries. This kind of behavior is generally seen in econ- 
omies where people have greater mobility, maintain 
an intensive work culture, and work for private 
corporate entities. Apart from these, there could 
be other dominant factors that have had a negative 
impact on the citizens’ willingness to participate in 
elections — politicians failing to keep their promises, 
inappropriately using public funds. 

In 2005, during the campaign period before the 
general election, a study was conducted in a Man- 
chester constituency in the United Kingdom. The 
constituency’s voter turnout rate in the 2001 general 
election had been 48.6%, while the national average 
had been 59.4%. Thus, voter participation in this con- 
stituency was far below the national average. For the 
experiment, three groups (two treatment groups and 
one control group) were randomly selected out of the 
registered voters from whom landline numbers could 
be obtained. One of the treatment groups was exposed 
to strong canvassing in the form of telephone calls, and 
the other treatment group was exposed to strong can- 
vassing in the form of door-to-door visits. No contacts 
were made with the control group. The callers and 
the door-to-door canvassers were given instructions 
to ask respondents three questions, namely, whether 
the respondents thought voting is important, whether 
the respondents intended to vote, and whether they 
would vote by post. The conversations were informal 
and the main objective of this exercise was to per- 
suade citizens to vote, by focusing on the importance 


of voting. The callers and canvassers were also advised 
to respond to any concerns of the voters regarding the 
voting process. 

The researchers got interesting results from the 
elections. The participation rate was 55.1% in the 
group, which was exposed to canvassing. The par- 
ticipation rate for the treatment group, which was 
treated with telephone calls, was 55%. Both these 
rates had a difference with the control group, which 
was not exposed to any experiment. Further cal- 
culations using suitable methodologies gave esti- 
mates of the effects of canvassing and telephone 
calls. 6.7% and 73% were the estimates of the two. 
The overall experiment was a success as the two 
interventions done on the two treatments groups 
by a non-partisan source had impacts that were sta- 
tistically significant. 

This exercise illustrated that citizens can be 
nudged to participate in elections by creating 
awareness through personal contacts. In yet another 
democracy, India, the 2014 general election saw a 
record voter turnout. A top Election Commission 
official has said that the Election Commission’s 
efforts to increase voters’ awareness and their reg- 
istration has helped the process. 


Sources: 1. Alice Moseley, Corinne Wales, Gerry Stoker, 
Graham Smith, Liz Richardson, Peter John, and Sarah Cot- 
terill, “Nudge, Nudge, Think, Think Experimenting with 
Ways to Change Civic Behaviour,” Bloomsbury Academic, 
March 2013. 2. “Lok Sabha Polls 2014: Country Records 
Highest Voter Turnout since Independence,” The Economic 
Times, May 13, 2014. 
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The t-Statistic and the Student t Distribution 


The t-statistic testing the mean. Consider the t-statistic used to test the hypothesis 
that the mean of Y is uyo, using data Y;,..., Y,,. The formula for this statistic is given 
by Equation (3.10), where the standard error of Y is given by Equation (3.8). Substi- 
tution of the latter expression into the former yields the formula for the t-statistic: 


Y = 
po (3.22) 


Vs? /n 


where s¥ is given in Equation (3.7). 

As discussed in Section 3.2, under general conditions the t-statistic has a standard 
normal distribution if the sample size is large and the null hypothesis is true [see 
Equation (3.12)]. Although the standard normal approximation to the t-statistic is 
reliable for a wide range of distributions of Y if n is large, it can be unreliable if n is 
small. The exact distribution of the t-statistic depends on the distribution of Y, and it 
can be very complicated. There is, however, one special case in which the exact dis- 
tribution of the t-statistic is relatively simple: If Y,,..., Y, are ii.d. draws from a 
normal distribution, then the t-statistic in Equation (3.22) has a Student t distribution 
with n — 1 degrees of freedom. (The mathematics behind this result is provided in 
Sections 18.4 and 19.4.) 

If the population distribution is normally distributed, then critical values from 
the Student ¢ distribution can be used to perform hypothesis tests and to construct 
confidence intervals. As an example, consider a hypothetical problem in which 
t°“ = 2.15 and n = 8,s0 that the degrees of freedom ism — 1 = 7. From Appendix 
Table 2, the 5% two-sided critical value for the t; distribution is 2.36. Because the 
t-statistic is smaller in absolute value than the critical value (2.15 < 2.36), the null 
hypothesis would not be rejected at the 5% significance level against the two-sided 
alternative. The 95% confidence interval for uy, constructed using the t; distribution, 
would be Y + 2.36SE(Y). This confidence interval is wider than the confidence 
interval constructed using the standard normal critical value of 1.96. 


The t-statistic testing differences of means. The t-statistic testing the difference of 
two means, given in Equation (3.20), does not have a Student ¢ distribution, even if 
the population distribution of Y is normal. (The Student t distribution does not apply 
here because the variance estimator used to compute the standard error in 
Equation (3.19) does not produce a denominator in the t-statistic with a chi-squared 
distribution.) 

A modified version of the differences-of-means t-statistic, based on a different 
standard error formula—the “pooled” standard error formula—has an exact Student 
t distribution when Y is normally distributed; however, the pooled standard error 
formula applies only in the special case that the two groups have the same variance 
or that each group has the same number of observations (Exercise 3.21). Adopt the 
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notation of Equation (3.19) so that the two groups are denoted as m and w. The 
pooled variance estimator is 


Nm 


1 > (Y Yn) + > (Yi = Yn)? (3.23) 


i=1 i=1 ’ 
Nm + Ny 2 group m group w 


2 = 
Spooled = 


where the first summation is for the observations in group m and the second summa- 
tion is for the observations in group w. The pooled standard error of the difference 
in means is SEpootea (Ym — Yw) = Spootea X V1/Mm +1/ny, and the pooled 
t-statistic is computed using Equation (3.20), where the standard error is the pooled 
standard error, SE pooled (Yn — Yw). 

If the population distribution of Y in group m is N (um, o7,), if the population 


distribution of Y in group w is N (up, o2), and if the two group variances are the 


same (that is, o2, = o2,), then under the null hypothesis the t-statistic computed using 
the pooled standard error has a Student ¢ distribution with n,, + n, — 2 degrees of 
freedom. 

The drawback of using the pooled variance estimator Cane is that it applies only 
if the two population variances are the same (assuming nm ~ nw). If the population 
variances are different, the pooled variance estimator is biased and inconsistent. If 
the population variances are different but the pooled variance formula is used, the 
null distribution of the pooled t-statistic is not a Student f distribution, even if the 
data are normally distributed; in fact, it does not even have a standard normal distri- 
bution in large samples. Therefore, the pooled standard error and the pooled t-statistic 
should not be used unless you have a good reason to believe that the population 
variances are the same. 


Use of the Student t Distribution in Practice 


For the problem of testing the mean of Y, the Student t distribution is applicable if 
the underlying population distribution of Y is normal. For economic variables, 
however, normal distributions are the exception (for example, see the boxes in 
Chapter 2 “The Distribution of Adulthood Earnings in the United Kingdom” and 
“The Unpegging of the Swiss Franc”). Even if the data are not normally distributed, 
the normal approximation to the distribution of the t-statistic is valid if the sample size 
is large. Therefore, inferences — hypothesis tests and confidence intervals—about the 
mean of a distribution should be based on the large-sample normal approximation. 
When comparing two means, any economic reason for two groups having 
different means typically implies that the two groups also could have different vari- 
ances. Accordingly, the pooled standard error formula is inappropriate, and the cor- 
rect standard error formula, which allows for different group variances, is as given in 
Equation (3.19). Even if the population distributions are normal, the t-statistic com- 
puted using the standard error formula in Equation (3.19) does not have a Student 
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t distribution. In practice, therefore, inferences about differences in means should be 
based on Equation (3.19), used in conjunction with the large-sample standard normal 
approximation. 

Even though the Student ¢ distribution is rarely applicable in economics, some 
software uses the Student ¢ distribution to compute p-values and confidence intervals. 
In practice, this does not pose a problem because the difference between the Student 
t distribution and the standard normal distribution is negligible if the sample size is 
large. For n > 15, the difference in the p-values computed using the Student ¢ and 
standard normal distributions never exceeds 0.01; for n > 80, the difference never 
exceeds 0.002. In most modern applications, and in all applications in this text, the 
sample sizes are in the hundreds or thousands, large enough for the difference between 
the Student ¢ distribution and the standard normal distribution to be negligible. 


Scatterplots, the Sample Covariance, and 
the Sample Correlation 


What is the relationship between age and earnings? This question, like many others, 
relates one variable, X (age), to another, Y (earnings). This section reviews three 
ways to summarize the relationship between variables: the scatterplot, the sample 
covariance, and the sample correlation coefficient. 


Scatterplots 


A scatterplot is a plot of n observations on X; and Y, in which each observation is 
represented by the point (X;, Y;). For example, Figure 3.2 is a scatterplot of age (X) 
and hourly earnings (Y) for a sample of 200 managers in the information industry 
from the March 2016 CPS. Each dot in Figure 3.2 corresponds to an (X, Y) pair for 
one of the observations. For example, one of the workers in this sample is 45 years 
old and earns $49.15 per hour; this worker’s age and earnings are indicated by the 
highlighted dot in Figure 3.2. The scatterplot shows a positive relationship between 
age and earnings in this sample: Older workers tend to earn more than younger 
workers. This relationship is not exact, however, and earnings could not be predicted 
perfectly using only a person’s age. 


Sample Covariance and Correlation 


The covariance and correlation were introduced in Section 2.3 as two properties of 
the joint probability distribution of the random variables X and Y. Because the popu- 
lation distribution is unknown, in practice we do not know the population covariance 
or correlation. The population covariance and correlation can, however, be estimated 
by taking a random sample of n members of the population and collecting the data 
(X, Y), i= 1,..., n. 
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GEL Scatterplot of Average Hourly Earnings vs. Age 
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Each point in the plot represents the age and average earnings of one of the 200 workers in the 
sample. The highlighted dot corresponds to a 45-year-old worker who earns $49.15 per hour. The 
data are for computer and information systems managers from the March 2016 CPS. 


= #2 


The sample covariance and correlation are estimators of the population covari- 
ance and correlation. Like the estimators discussed previously in this chapter, they 
are computed by replacing a population mean (the expectation) with a sample mean. 
The sample covariance, denoted syy, is 


suv = D- BVH P). (3.24) 


Like the sample variance, the average in Equation (3.24) is computed by dividing by 
n — 1 instead of n; here, too, this difference stems from using X and Y to estimate 
the respective population means. When n is large, it makes little difference whether 
division is byn orn — 1. 
The sample correlation coefficient, or sample correlation, is denoted ryy and is 
the ratio of the sample covariance to the sample standard deviations: 
SXY 
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The sample correlation measures the strength of the linear association between X 
and Y in a sample of n observations. Like the population correlation, the sample cor- 
relation is unit free and lies between —1 and 1: |ryy| < 1. 

The sample correlation equals 1 if X; = Y; for alli and equals —1 if X; = —Y; for 
all i. More generally, the correlation is +1 if the scatterplot is a straight line. If the 
line slopes upward, then there is a positive relationship between X and Y and 
the correlation is 1. If the line slopes down, then there is a negative relationship and 
the correlation is —1. The closer the scatterplot is to a straight line, the closer the 
correlation is to +1. A high correlation coefficient does not necessarily mean that 
the line has a steep slope; rather, it means that the points in the scatterplot fall very 
close to a straight line. 


Consistency of the sample covariance and correlation. Like the sample variance, 
the sample covariance is consistent. That is, 


SyYy > Oxy: (3.26) 


In other words, in large samples the sample covariance is close to the population 
covariance with high probability. 

The proof of the result in Equation (3.26) under the assumption that (X;, Y,) are 
i.i.d. and that X; and Y; have finite fourth moments is similar to the proof in Appendix 3.3 
that the sample covariance is consistent and is left as an exercise (Exercise 3.20). 

Because the sample variance and sample covariance are consistent, the sample 
correlation coefficient is consistent; that is, ryy —2> corr(X;, ¥;). 


Example. As an example, consider the data on age and earnings in Figure 3.2. For 
these 200 workers, the sample standard deviation of age is s4 = 9.57 years, and the 
sample standard deviation of earnings is sz = $19.93 per hour. The sample covari- 
ance between age and earnings is 54; = 91.51 (the units are years X dollars per 
hour, not readily interpretable). Thus the sample correlation coefficient is 
rag = 91.51/(9.57 X 19.93) = 0.48. The correlation of 0.48 means that there is a 
positive relationship between age and earnings, but as is evident in the scatterplot, 
this relationship is far from perfect. 

To verify that the correlation does not depend on the units of measurement, 
suppose that earnings had been reported in cents, in which case the sample 
standard deviation of earnings is 1993¢ per hour and the covariance between age 
and earnings is 9151 (units are years X cents per hour); then the correlation is 
9151/ (9.57 X 1993) = 0.48, or 48%. 

Figure 3.3 gives additional examples of scatterplots and correlation. Figure 3.3a 
shows a strong positive linear relationship between these variables, and the sample 
correlation is 0.9. 

Figure 3.3b shows a strong negative relationship with a sample correlation 
of —0.8. Figure 3.3c shows a scatterplot with no evident relationship, and the sample 


130 CHAPTER 3_ Review of Statistics 


| FIGURE 3.3 | Scatterplots for Four Hypothetical Data Sets 
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correlation is 0. Figure 3.3d shows a clear relationship: As X increases, Y initially 
increases but then decreases. Despite this discernable relationship between X and Y, 
the sample correlation is 0; the reason is that for these data small values of Y are 
associated with both large and small values of X. 

This final example emphasizes an important point: The correlation coefficient is 
a measure of linear association. There is a relationship in Figure 3.3d, but it is not 
linear. 
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Summary 


1. 


The sample average, Y, is an estimator of the population mean, wy. When 

Y,,...,¥, are i.i.d., 

a. the sampling distribution of Y has mean py and variance oy = ob /n; 

b. Y is unbiased; 

c. by the law of large numbers, Y is consistent; and 

d. by the central limit theorem, Y has an approximately normal sampling 
distribution when the sample size is large. 

The t-statistic is used to test the null hypothesis that the population mean takes 

on a particular value. If n is large, the t-statistic has a standard normal sampling 

distribution when the null hypothesis is true. 

The t-statistic can be used to calculate the p-value associated with the null 

hypothesis. The p-value is the probability of drawing a statistic at least as 

adverse to the null hypothesis as the one you actually computed in your sam- 

ple, assuming the null hypothesis is correct. A small p-value is evidence that 

the null hypothesis is false. 

A 95% confidence interval for jy is an interval constructed so that it contains 

the true value of uy in 95% of all possible samples. 

Hypothesis tests and confidence intervals for the difference in the means of 

two populations are conceptually similar to tests and intervals for the mean of 

a single population. 

The sample correlation coefficient is an estimator of the population correlation 

coefficient and measures the linear relationship between two variables— that 

is, how well their scatterplot is approximated by a straight line. 


Key Terms 


estimator (105) 

estimate (105) 

bias (106) 

consistency (106) 

efficiency (106) 

BLUE (Best Linear Unbiased 
Estimator) (107) 

least squares estimator (107) 

hypothesis tests (109) 

null hypothesis (109) 

alternative hypothesis (109) 

two-sided alternative hypothesis (109) 


p-value (significance probability) (110) 
sample variance (112) 

sample standard deviation (112) 
degrees of freedom (112) 
standard error of Y (113) 
t-statistic (113) 

t-ratio (113) 

test statistic (113) 

type I error (115) 

type II error (115) 

significance level (115) 

critical value (115) 
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rejection region (115) test for the difference between two 
acceptance region (115) means (119) 

size of a test (115) causal effect (121) 

power of a test (115) treatment effect (121) 

one-sided alternative hypothesis (116) scatterplot (127) 

confidence set (117) sample covariance (128) 

confidence level (117) sample correlation coefficient (sample 
confidence interval (117) correlation) (128) 


coverage probability (118) 


7 
MyLab Economics Can Help You Get a Better Grade 
MyLa b E conom ics If your exam were tomorrow, would you be ready? For each 


help you prepare for your exams. You can also find the Exercises and all Review the Concepts Questions 
available now in MyLab Economics. To see how it works, turn to the MyLab Economics spread on the 
inside front cover of this text and then go to www.pearson.com/mylab/economics. 


For additional Empirical Exercises and Data Sets, log on to the Companion Website at 
www.pearsonglobaleditions.com. 
Xe 


chapter, MyLab Economics Practice Tests and Study Plan 


Review the Concepts 


3.1 
3.2 


3.3 


3.4 


3.5 


3.6 


3.7 


3.8 


Explain the difference between an unbiased estimator and a consistent estimator. 


What is meant by the efficiency of an estimator? Which estimator is known as 
BLUE? 


A population distribution has a mean of 15 and a variance of 10. Determine 
the mean and variance of Y from an i.i.d. sample from this population for 
(a)n = 5;(b) n = 500; and (c) n = 5000. Relate your answers to the law of 
large numbers. 


What is the difference between standard error and standard deviation? How 
is the standard error of the sample mean calculated? 


What is the difference between a null hypothesis and an alternative hypoth- 
esis? Among size, significance level, and power? Between a one-sided alterna- 
tive hypothesis and a two-sided alternative hypothesis? 


Why does a confidence interval contain more information than the result of 
a single hypothesis test? 


What is a scatterplot? What statistical features of a dataset can be represented 
using a scatterplot diagram? 


Sketch a hypothetical scatterplot for a sample of size 10 for two random variables 
with a population correlation of (a) 1.0; (b) —1.0; (c) 0.9; (d) —0.5; and (e) 0.0. 
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Exercises 


3.1 


3.2 


3.3 


3.4 


3.5 


In a population, wy = 75 and a} = 45. Use the central limit theorem to 
answer the following questions: 

a. In arandom sample of size n = 50, find Pr(Y < 73). 

b. In arandom sample of size n = 90, find Pr(76 < Y < 77). 

c. In arandom sample of size n = 120, find Pr(Y > 69). 

Let Y be a Bernoulli random variable with success probability Pr( Y = 1) = p, 
and let Y;,..., Y, be i.i.d. draws from this distribution. Let p be the fraction 
of successes (1s) in this sample. 

a. Show that p = Y. 

b. Show that p is an unbiased estimator of p. 

c. Show that var(p) = p(1 — p)/n. 

In a poll of 500 likely voters, 270 responded that they would vote for the candi- 
date from the democratic party, while 230 responded that they would vote for the 
candidate from the republican party. Let p denote the fraction of all likely voters 


who preferred the democratic candidate at the time of the poll, and let p be the 
fraction of survey respondents who preferred the democratic candidate. 


a. Use the poll results to estimate p. 


en 


. Use the estimator of the variance of f), p(1 — p)/n, to calculate the 
standard error of your estimator. 


. What is the p-value for the test of Hp: p = 0.5, vs. Hj: p # 0.5? 
. What is the p-value for the test of Hp: p = 0.5, vs. H,: p > 0.5? 
. Why do the results from (c) and (d) differ? 


. Did the poll contain statistically significant evidence that the democratic 


= © a Aa 


candidate was ahead of the republican candidate at the time of the poll? 
Explain. 


Using the data in Exercise 3.3: 


Construct a 95% confidence interval for p. 


a. 
b. Construct a 99% confidence interval for p. 


[e] 


. Why is the interval in (b) wider than the interval in (a)? 


a 


. Without doing any additional calculations, test the hypothesis 
A: p = 0.50 vs. Hy: p # 0.50 at the 5% significance level. 


A survey of 1000 registered voters is conducted, and the voters are asked to 
choose between candidate A and candidate B. Let p denote the fraction of 
voters in the population who prefer candidate A, and let p denote the fraction 
of voters in the sample who prefer candidate A. 


a. You are interested in the competing hypotheses Hp: p = 0.4 vs. 
H,: p # 0.4. Suppose that you decide to reject Hp if |p — 0.4| > 0.01. 
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3.6 


3.7 


3.8 


3.9 


i. What is the size of this test? 
ii. Compute the power of this test if p = 0.45. 
b. In the survey, p = 0.44. 
i. Test Hy: p = 0.4 vs. H: p # 0.4 using a 10% significance level. 
ii. Test Hy: p = 0.4 vs. Hı: p < 0.4 using a 10% significance level. 
iii. Construct a 90% confidence interval for p. 
iv. Construct a 99% confidence interval for p. 
v. Construct a 60% confidence interval for p. 


c. Suppose that the survey is carried out 30 times, using independently 
selected voters in each survey. For each of these 30 surveys, a 90% confi- 
dence interval for p is constructed. 


i. What is the probability that the true value of p is contained in all 30 
of these confidence intervals? 


ii. How many of these confidence intervals do you expect to contain the 
true value of p? 


d. In survey jargon, the “margin of error” is 1.96 X SE(p); that is, it is half 
the length of the 95% confidence interval. Suppose you want to design 
a survey that has a margin of error of at most 0.5%. That is, you want 
Pr(|p — p| > 0.005 = 0.005). How large should n be if the survey uses 
simple random sampling? 


Let Y,..., Y, be iid. draws from a distribution with mean p. A test of 
Ao: w = 10 vs. HA: u # 10 using the usual t-statistic yields a p-value of 0.07 


a. Does the 90% confidence interval contain u = 10? Explain. 


b. Can you determine if u = 8 is contained in the 95% confidence 
interval? Explain. 


In a given population, 50% of the likely voters are women. A survey using 
a simple random sample of 1000 landline telephone numbers finds 55% 
women. Is there evidence that the survey is biased? Explain. 


A new version of the SAT is given to 1500 randomly selected high school 
seniors. The sample mean test score is 1230, and the sample standard deviation 
is 145. Construct a 95% confidence interval for the population mean test score 
for high school seniors. 


Suppose that a plant manufactures integrated circuits with a mean life of 
1000 hours and a standard deviation of 100 hours. An inventor claims to have 
developed an improved process that produces integrated circuits with a lon- 
ger mean life and the same standard deviation. The plant manager randomly 
selects 50 integrated circuits produced by the process. She says that she will 
believe the inventor’s claim if the sample mean life of the integrated circuits 


3.10 


3.11 


3.12 
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is greater than 1100 hours; otherwise, she will conclude that the new process 
is no better than the old process. Let u denote the mean of the new process. 
Consider the null and alternative hypotheses Hy: w = 1000 vs. Hı: u > 1000. 


a. What is the size of the plant manager’s testing procedure? 


b. Suppose the new process is in fact better and has a mean integrated 
circuit life of 1150 hours. What is the power of the plant manager’s testing 
procedure? 


c. What testing procedure should the plant manager use if she wants the 
size of her test to be 1%? 


Suppose a new standardized test is given to 150 randomly selected third-grade 
students in Amsterdam. The sample average score Y on the test is 42 points, 
and the sample standard deviation, sy, is 6 points. 


a. The authors plan to administer the test to all third-grade students in 
Amsterdam. Construct a 99% confidence interval for the mean score of 
all third graders in Amsterdam. 


b. Suppose the same test is given to 300 randomly selected third graders 
from Rotterdam, producing a sample average of 48 points and sample 
standard deviation of 10 points. Construct a 95% confidence interval for 
the difference in mean scores between Rotterdam and Amsterdam. 

c. Can you conclude with a high degree of confidence that the population 
means for Rotterdam and Amsterdam students are different? (What is 
the standard error of the difference in the two sample means? What is the 
p-value of the test of no difference in means versus some difference?) 


Consider the estimator Y, defined in Equation (3.1). Show that (a) E(Y) = py 
and (b) var(Y) = 1.250%} /n. 


To investigate possible gender discrimination in a British firm, a sample of 120 
men and 150 women with similar job descriptions are selected at random. A 
summary of the resulting monthly salaries follows: 


Average Salary (Y) Standard Deviation (sy) n 
Men £8200 £450 120 
| Women £7900 £520 150 J 


a. What do these data suggest about wage differences in the firm? Do 
they represent statistically significant evidence that average wages of 
men and women are different? (To answer this question, first, state the 
null and alternative hypotheses; second, compute the relevant t-statistic; 
third, compute the p-value associated with the t-statistic; and, finally, use 
the p-value to answer the question.) 


b. Do these data suggest that the firm is guilty of gender discrimination in 
its compensation policies? Explain. 
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3.13 Data on fifth-grade test scores (reading and mathematics) for 400 school districts 
in Brussels yield average score Y = 712.1 and standard deviation sy = 23.2. 


a. Construct a 90% confidence interval for the mean test score in the 
population. 
b. When the districts were divided into districts with small classes ( < 20 


students per teacher) and large classes ( = 20 students per teacher), the 
following results were found: 


Class Size Average Salary (Y) Standard Deviation (sy) 


Is there statistically significant evidence that the districts with smaller 
classes have higher average test scores? Explain. 


3.14 Values of height in inches (X) and weight in pounds (Y) are recorded from 
a sample of 200 male college students. The resulting summary statistics are 
X = 71.2 in., Y = 164 1b, sy = 1.9 in., sy = 16.4 lb, syy = 22.54 in. X Ib, 
and ryy = 0.8. Convert these statistics to the metric system (meters and 
kilograms). 


3.15 Y, and Y, are Bernoulli random variables from two different populations, 
denoted a and b. Suppose E(Y,) = pa and E(Y,) = pp. A random sample of 
size n, is chosen from population a, with a sample average denoted p,, and 
a random sample of size n, is chosen from population b, with a sample aver- 
age denoted p,. Suppose the sample from population a is independent of the 
sample from population b. 


a. Show that E(Pa) z Pa and var (Pa) = Pal = Pa) [na Show that 
E(P») = py and var( Pp) = pa(1 — py) /no- 
1 1- a 1- 
b. Show that var(p, — py) = Pal Pa) + Pl Ps) 
Na Np 
(Hint: Remember that the samples are independent.) 


c. Suppose n, and n, are large. Show that a 95% confidence interval for 


i . A a A 1 — 2 wz 1 _ aA 
Pa — Ppis given by (Pa — Pp) + 1.96, JÊ T Pa) + Pol - Po) 
b 
How would you construct a 90% confidence interval for pa — Pp? 


3.16 Assume that grades on a standardized test are known to have a mean of 500 for 
students in Europe. The test is administered to 600 randomly selected students 
in Ukraine; in this sample, the mean is 508, and the standard deviation (s) is 75. 


a. Construct a 95% confidence interval for the average test score for 
Ukrainian students. 
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b. Is there statistically significant evidence that Ukrainian students perform 
differently than other students in Europe? 


c. Another 500 students are selected at random from Ukraine. They are 
given a 3-hour preparation course before the test is administered. Their 
average test score is 514, with a standard deviation of 65. 


i. Construct a 95% confidence interval for the change in average test 
score associated with the prep course. 


ii. Is there statistically significant evidence that the prep course 
helped? 
d. The original 600 students are given the prep course and then are asked 
to take the test a second time. The average change in their test scores is 7 
points, and the standard deviation of the change is 40 points. 


i. Construct a 95% confidence interval for the change in average test scores. 


ii. Is there statistically significant evidence that students will perform 
better on their second attempt, after taking the prep course? 


iii. Students may have performed better in their second attempt because 
of the prep course or because they gained test-taking experience in 
their first attempt. Describe an experiment that would quantify these 
two effects. 


3.17 Read the box “Social Class or Education? Childhood Circumstances and 
Adult Earnings Revisited” in Section 3.5. 


a. Construct a 95% confidence interval for the difference in the house- 
hold earnings of people whose father NS-SEC classification was higher 
between those with no educational qualifications and those with an 
undergraduate degree or more. 

b. Construct a 95% confidence interval for the difference in the house- 
hold earnings of people whose father NS-SEC classification was routine 
between those with no educational qualifications and those with an 
undergraduate degree or more. 


c. Construct a 95% confidence interval for the difference between your 
answers calculated in parts a and b. 


3.18 This exercise shows that the sample variance is an unbiased estimator of the 


population variance when Y,,..., Y, are i.i.d. with mean py and variance o¥. 


a. Use Equation (2.32) to show that 
E(Y; — Y)? = var(Y;) — 2cov( Y, Y) + var(Y). 
b. Use Equation (2.34) to show that cov(Y, Y;) = o}/n. 


c. Use the results in (a) and (b) to show that E(s}) = oy. 
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3.19 


3.20 


3.21 


3.22 


a. Y is an unbiased estimator of wy. Is Y? an unbiased estimator of uy? 


b. Y is a consistent estimator of uy. Is Y? a consistent estimator of u5;? 


Suppose (X;, Y;) are i.i.d. with finite fourth moments. Prove that the sam- 
ple covariance is a consistent estimator of the population covariance; that is, 
Sxy — oxy, Where syy is defined in Equation (3.24). (Hint: Use the strategy 
of Appendix 3.3.) 


Show that the pooled standard error [SE,oo1ea (Yn — Yw) | given following 
Equation (3.23) equals the usual standard error for the difference in means 
in Equation (3.19) when the two group sizes are the same (nm = ny). 


Suppose Y, ~ i.i.d.N( uy, oy) fori = 1,...,n.With o} known, the t-statistic 
for testing Ho: wy = 0 vs. Hy: py > 0 is t= (Y — 0)/SE(Y), where 
SE(Y) = oy/Vn.Suppose oy = 10andn = 100,so that SE(Y) = 1.Using 
a test with a size of 5%, the null hypothesis is rejected if t > 1.64. 


a. Suppose uy = 0,so the null hypothesis is true. What is the probability 
that the null hypothesis is rejected? 


b. Suppose uy = 2, so the alternative hypothesis is true. What is the 
probability that the null hypothesis is rejected? 


c. Suppose that in 90% of cases the data are drawn from a population 
where the null is true (uy = 0) and in 10% of cases the data come from 
a population where the alternative is true and wy = 2. Your data came 
from either the first or the second population, but you don’t know which. 


i. You compute the t-statistic. What is the probability that t > 1.64—that 
is, that you reject the null hypothesis? 


ii. Suppose you reject the null hypothesis; that is, t£ > 1.64. What is 
the probability that the sample data were drawn from the uy = 0 
population? 

d. It is hard to discover a new effective drug. Suppose 90% of new drugs 
are ineffective and only 10% are effective. Let Y denote the drop in the 
level of a specific blood toxin for a patient taking a new drug. If the drug 
is ineffective, wy = 0 and oy = 10; if the drug is effective, wy = 2 and 
oy = 10. 

i. A new drug is tested on a random sample of n = 100 patients, data 
are collected, and the resulting t-statistic is found to be greater than 
1.64. What is the probability that the drug is ineffective (i.e., what is 
the false positive rate for the test using t > 1.64)? 


ii. Suppose the one-sided test uses instead the 0.5% significance level. 


What is the probability that the drug is ineffective (i.e., what is the 
false positive rate)? 
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Empirical Exercises 


E3.1 On the text website, http://www.pearsonglobaleditions.com, you will find the 


data file CPS96_15, which contains an extended version of the data set used in 
Table 3.1 of the text for the years 1996 and 2015. It contains data on full-time 
workers, ages 25-34, with a high school diploma or a B.A./B.S. as their highest 


degree. A detailed description is given in CPS96_15_ Description, available on 


the website. Use these data to complete the following. 


a. 


= 


i. Compute the sample mean for average hourly earnings (AHE) in 
1996 and 2015. 


ii. Compute the sample standard deviation for AHE in 1996 and 2015. 


ii. Construct a 95% confidence interval for the population means of AHE 
in 1996 and 2015. 


iv. Construct a 95% confidence interval for the change in the population 
means of AHE between 1996 and 2015. 


. In 2015, the value of the Consumer Price Index (CPI) was 2370. In 1996, 


the value of the CPI was 156.9. Repeat (a), but use AHE measured 
in real 2015 dollars ($2015); that is, adjust the 1996 data for the price 
inflation that occurred between 1996 and 2015. 


. If you were interested in the change in workers’ purchasing power from 


1996 to 2015, would you use the results from (a) or (b)? Explain. 


. Using the data for 2015: 


i. Construct a 95% confidence interval for the mean of AHE for high 
school graduates. 
ii. Construct a 95% confidence interval for the mean of AHE for 
workers with a college degree. 
iii. Construct a 95% confidence interval for the difference between the 
two means. 


. Repeat (d) using the 1996 data expressed in $2015. 


. Using appropriate estimates, confidence intervals, and test statistics, 


answer the following questions: 


i. Did real (inflation-adjusted) wages of high school graduates increase 
from 1996 to 2015? 


ii. Did real wages of college graduates increase? 


iii. Did the gap between earnings of college and high school graduates 
increase? Explain. 


. Table 3.1 presents information on the gender gap for college graduates. 


Prepare a similar table for high school graduates, using the 1996 and 
2015 data. Are there any notable differences between the results for high 
school and college graduates? 
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E3.2 A consumer is given the chance to buy a baseball card for $1, but he declines 


the trade. If the consumer is now given the baseball card, will he be willing to 
sell it for $1? Standard consumer theory suggests yes, but behavioral econo- 
mists have found that “ownership” tends to increase the value of goods to 
consumers. That is, the consumer may hold out for some amount more than 
$1 (for example, $1.20) when selling the card, even though he was willing 
to pay only some amount less than $1 (for example, $0.88) when buying it. 
Behavioral economists call this phenomenon the “endowment effect.” John 
List investigated the endowment effect in a randomized experiment involv- 
ing sports memorabilia traders at a sports-card show. Traders were randomly 
given one of two sports collectibles, say good A or good B, that had approx- 
imately equal market value.* Those receiving good A were then given the 
option of trading good A for good B with the experimenter; those receiv- 
ing good B were given the option of trading good B for good A with the 
experimenter. Data from the experiment and a detailed description can be 
found on the text website, http://www.pearsonglobaleditions.com, in the files 
Sportscards and Sportscards_Description.* 


a. i. Suppose that, absent any endowment effect, all the subjects prefer good 
A to good B. What fraction of the experiment’s subjects would you 
expect to trade the good that they were given for the other good? (Hint: 
Because of random assignment of the two treatments, approximately 
50% of the subjects received good A, and 50% received good B.) 


ii. Suppose that, absent any endowment effect,50% of the subjects prefer 
good A to good B, and the other 50% prefer good B to good A. What 
fraction of the subjects would you expect to trade the good they were 
given for the other good? 

ii. Suppose that, absent any endowment effect, X% of the subjects prefer 
good A to good B, and the other (100 — X)% prefer good B to good 
A. Show that you would expect 50% of the subjects to trade the good 
they were given for the other good. 


b. Using the sports-card data, what fraction of the subjects traded the good they 
were given? Is the fraction significantly different from 50%? Is there evi- 
dence of an endowment effect? (Hint: Review Exercises 3.2 and 3.3.) 


c. Some have argued that the endowment effect may be present but that it 
is likely to disappear as traders gain more trading experience. Half of the 
experimental subjects were dealers, and the other half were nondealers. 
Dealers have more experience than nondealers. Repeat (b) for dealers 
and nondealers. Is there a significant difference in their behavior? 


3Good A was a ticket stub from the game in which Cal Ripken, Jr., set the record for consecutive games 
played, and good B was a souvenir from the game in which Nolan Ryan won his 300th game. 


These data were provided by Professor John List of the University of Chicago and were used in his paper “Does 
Market Experience Eliminate Market Anomalies,” Quarterly Journal of Economics, 2003, 118(1): 41-71. 
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Is the evidence consistent with the hypothesis that the endowment effect 
disappears as traders gain more experience? (Hint: Review Exercise 3.15.) 


The U.S. Current Population Survey 


Each month the U.S. Census Bureau and the U.S. Bureau of Labor Statistics conduct the Cur- 
rent Population Survey (CPS), which provides data on labor force characteristics of the popu- 
lation, including the levels of employment, unemployment, and earnings. Approximately 
54,000 U.S. households are surveyed each month. The sample is chosen by randomly selecting 
addresses from a database of addresses from the most recent decennial census augmented with 
data on new housing units constructed after the last census. The exact random sampling 
scheme is rather complicated (first, small geographical areas are randomly selected; then hous- 
ing units within these areas are randomly selected); details can be found in the Handbook of 
Labor Statistics and on the Bureau of Labor Statistics website (www.bls.gov). 

The survey conducted each March is more detailed than those in other months and asks 
questions about earnings during the previous year. The statistics in Tables 2.4 and 3.1 were com- 
puted using the March surveys. The CPS earnings data are for full-time workers, defined to be 
persons employed more than 35 hours per week for at least 48 weeks in the previous year. 

More details on the data can be found in the replication materials for this chapter, avail- 


able at http://www.pearsonglobaleditions.com. 


Two Proofs That Y Is the Least Squares 
Estimator of uy 


This appendix provides two proofs, one using calculus and one not, that Y minimizes the sum 
of squared prediction mistakes in Equation (3.2)— that is, that Y is the least squares estimator 
of E(Y). 


Calculus Proof 


To minimize the sum of squared prediction mistakes, take its derivative and set it to 0: 


d n n n 
oa AO m)? = 120 m) = 22% + 2nm = 0. (3.27) 


2 


Solving for the final equation for m shows that }’;_,(¥; — m)? is minimized when m = Y. 
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Noncalculus Proof 


The strategy is to show that the difference between the least squares estimator and Y must 
be 0, from which it follows that Y is the least squares estimator. Let d = Y — m, so 
that m = Y — d.Then (Y, - m)? = (¥,- [Y- d])? = ([¥% - Y] + d)? = (Y, - Y)74 
2d(Y; — Y) + d?. Thus the sum of squared prediction mistakes [Equation (3.2)] is 


So; m) = X (Y-Y) 4 24> (Y, Y) + nd = So Y)? + nd*, (3.28) 


where the second equality uses the fact that £%-1(Y; — Y) = 0. Because both terms in the final 
line of Equation (3.28) are nonnegative and because the first term does not depend on 
d, >’,_,(¥; — m? is minimized by choosing d to make the second term, nd”, as small as possi- 
ble. This is done by setting d = 0—that is, by setting m = Y—so that Y is the least squares 
estimator of E(Y). 


A Proof That the Sample Variance 
Is Consistent 


This appendix uses the law of large numbers to prove that the sample variance, sy, is a consis- 
tent estimator of the population variance, e}, as stated in Equation (3.9), when Y,..., Y, are 
iid.and E( Y$) < œ. 


First, consider a version of the sample variance that uses n instead of n — 1 as a divisor: 


1. uae 4 1 n 2 1 t - 2 
DA Y) P ma tY 
1S? = Y? 
nfi 
p 2 


EPEA (3.29) 


where the first equality uses (Y; — Y)? = Y} — 2YY, + Y? and the second uses 45- Y; = Y. 

The convergence in the third line follows from (i) applying the law of large numbers to 
15:1 Y?} —> E(Y?) (which follows because Y? are i.i.d. and have finite variance because 
E(Y?) is finite), (ii) recognizing that E(Y?) = o} + uy (Key Concept 2.3), and (iii) noting 
Y —> py, so that Y? ——> yj. Finally, s} = (—24)(#>4-1(¥% — Y)?) > o$ follows 
from Equation (3.29) and (,4) > 1. 
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Linear Regression 
4 with One Regressor 


he superintendent of an elementary school district must decide whether to hire 
Eeoa teachers, and she wants your advice. Hiring the teachers will reduce the 
number of students per teacher (the student-teacher ratio) by two but will increase 
the district's expenses. So she asks you: If she cuts class sizes by two, what will the 
effect be on student performance, as measured by scores on standardized tests? 

Now suppose a father tells you that his family wants to move to a town with a 
good school system. He is interested in a specific school district: Test scores for this 
district are not publicly available, but the father knows its class size, based on the 
district's student-teacher ratio. So he asks you: if he tells you the district's class size, 
could you predict that district's standardized test scores? 

These two questions are clearly related: They both pertain to the relation between 
class size and test scores. Yet they are different. To answer the superintendent's ques- 
tion, you need an estimate of the causal effect of a change in one variable (the student- 
teacher ratio, X) on another (test scores, Y). To answer the father’s question, you need 
to know how X relates to Y, on average, across school districts so you can use this 
relation to predict Y given X in a specific district. 

These two questions are examples of two different types of questions that arise in 
econometrics. The first type of questions pertains to causal inference: using data to 
estimate the effect on an outcome of interest of an intervention that changes the value 
of another variable. The second type of questions concerns prediction: using the 
observed value of some variable to predict the value of another variable. 

This chapter introduces the linear regression model relating one variable, X, 
to another, Y. This model postulates a linear relationship between X and Y. Just as 
the mean of Y is an unknown characteristic of the population distribution of Y, the 
intercept and slope of the line relating X and Y are unknown characteristics of the 
population joint distribution of X and Y. The econometric problem is to estimate the 
intercept and slope using a sample of data on these two variables. 

Like the differences in means, linear regression is a statistical procedure that can be 
used for causal inference and for prediction. The two uses, however, place different 
requirements on the data. Section 3.5 explained how a difference in mean outcomes 
between a treatment and a control group estimates the causal effect of the treatment 
when the treatment is randomly assigned in an experiment. When X is continuous, com- 
puting differences-in-means no longer works because there are many values X can take 
on, not just two. If, however, we make the additional assumption that the relation between 
X and Y is linear, then if X is randomly assigned, we can use linear regression to estimate 
the causal effect on Y of an intervention that changes X. Even if X is not randomly assigned, 
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however, linear regression gives us a way to predict the value of Y given X by modeling the 
conditional mean of Y given X as a linear function of X. As long as the observation for 
which Y is to be predicted is drawn from the same population as the data used to estimate 
the linear regression, the regression line provides a way to predict Y given X. 

Sections 4.1-4.3 lay out the linear regression model and the least squares estima- 
tors of its slope and intercept. In Section 4.4, we turn to requirements on the data for 
estimation of a causal effect. In essence, the key requirement is that either X is set at 
random in an experiment or X is as-if randomly set. 

Our focus on causal inference continues through Chapter 13. We return to the 
prediction problem in Chapter 14. 


The Linear Regression Model 


Return to the father’s question: If he tells you the district’s class size, could you 
predict that district’s standardized test scores? In Chapter 2, we used the notation 
E(Y|X = x) to denote the mean of Y given that X takes on the value x—that is, 
the conditional expectation of Y given X = x. The easiest starting point for mod- 
eling a function of X, when X can take on multiple values, is to suppose that it is 
linear. In the case of test scores and class size, this linear function can be 
written 


E( TestScore|ClassSize) = By + Bctasssize X ClassSize, (4.1) 


where £ is the Greek letter beta, Bp is the intercept, and Bejgsssi-¢ 18 the slope. 

If you were lucky enough to know fy and Beigsssize, You could use Equation (4.1) 
to answer the father’s question. For example, suppose he was looking at a district 
with a class size of 20 and that By = 720 and BciassSize = —0.6.Then you could answer 
his question: Given that the class size is 20, you would predict test scores to be 
720 — 0.6 X 20 = 708. 

Equation (4.1) tells you what the test score will be, on average, for districts with 
class sizes of that value; it does not tell you what specifically the test score will be in 
any one district. Districts with the same class sizes can nevertheless differ in many 
ways and in general will have different values of test scores. As a result, if we use 
Equation (4.1) to make a prediction for a given district, we know that prediction will 
not be exactly right: The prediction will have an error. Stated mathematically, for any 
given district the imperfect relationship between class size and test score can be 
written 


TestScore = By + BctassSize X ClassSize + error. (4.2) 


Equation (4.2) expresses the test score for the district in terms of one component, 
Bo + Betasssize X ClassSize, that represents the average relationship between class 
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size and scores in the population of school districts, and a second component that 
represents the error made using the prediction in Equation (4.1). 

Although this discussion has focused on test scores and class size, the idea 
expressed in Equation (4.2) is much more general, so it is useful to introduce more 
general notation. Suppose you have a sample of n districts. Let Y; be the average test 
score in the i‘ district, and let X; be the average class size in the i district, so that 
Equation (4.1) becomes E( Y;|X;) = By + 6,X;. Let u; denote the error made by 
predicting Y; using its conditional mean. Then Equation (4.2) can be written more 
generally as 


Y; = Bo + BX; + u; (4.3) 


for each district (that is, i = 1, .. . , n), where Bp is the intercept of this line and £; is 
the slope. The general notation £; is used for the slope in Equation (4.3) instead of 
PClassSize because this equation is written in terms of a general variable X. 

Equation (4.3) is the linear regression model with a single regressor, in which Y 
is the dependent variable and X is the independent variable or the regressor. 

The first part of Equation (4.3), By + B,X;, is the population regression line 
or the population regression function. This is the relationship that holds between 
Y and X, on average, over the population. Thus, given the value of X, according 
to this population regression line you would predict the value of the dependent 
variable, Y, to be its conditional mean given X. That conditional mean is given by 
Equation (4.1) which, in the more general notation of Equation (4.3), is 
E(Y|X) = By + BX. 

The intercept £ and the slope £; are the coefficients of the population regres- 
sion line, also known as the parameters of the population regression line. The slope 
B, is the difference in Y associated with a unit difference in X. The intercept is the 
value of the population regression line when X = 0; it is the point at which the 
population regression line intersects the Y axis. In some econometric applica- 
tions, the intercept has a meaningful economic interpretation. In other applica- 
tions, the intercept has no real-world meaning; for example, when X is the class 
size, strictly speaking the intercept is the expected value of test scores when there 
are no students in the class! When the real-world meaning of the intercept is 
nonsensical, it is best to think of it simply as the coefficient that determines the 
level of the regression line. 

The term u; in Equation (4.3) is the error term. In the context of the prediction 
problem, uw; is the difference between Y; and its predicted value using the population 
regression line. 

The linear regression model and its terminology are summarized in Key 
Concept 4.1. 

Figure 4.1 summarizes the linear regression model with a single regressor for 
seven hypothetical observations on test scores (Y) and class size (X). The population 
regression line is the straight line By + 61X. The population regression line slopes 
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Terminology for the Linear Regression Model 
41 with a Single Regressor 


The linear regression model is 


Ve = E ae [AG F Ube 
where 


the subscript i runs over observations, i = 1, ...,n; 

Y; is the dependent variable, the regressand, or simply the left-hand variable; 
X;is the independent variable, the regressor, or simply the right-hand variable; 
Bo + B,X is the population regression line or the population regression function; 
Bo is the intercept of the population regression line; 

B, is the slope of the population regression line; and 


uiis the error term. 


(ie z B È 
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down (8; < 0), which means that districts with lower student-teacher ratios (smaller 
classes) tend to have higher test scores. The intercept By has a mathematical meaning 
as the value of the Y axis intersected by the population regression line, but, as men- 
tioned earlier, it has no real-world meaning in this example. 


4.2 


Summary of the Distribution of Student-Teacher Ratios and Fifth-Grade 


Student-teacher ratio 19.6 19 173 18.6 19.3 19.7 20.1 20.9 21.9 


Test score 
Ya 
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The hypothetical observations in Figure 4.1 do not fall exactly on the population 
regression line. For example, the value of Y for district 1, Y}, is above the population 
regression line. This means that test scores in district 1 were better than predicted by 
the population regression line, so the error term for that district, u4, is positive. In 
contrast, Y, is below the population regression line, so test scores for that district were 
worse than predicted and u, < 0. 


Estimating the Coefficients of the Linear 
Regression Model 


In a practical situation such as the application to class size and test scores, the inter- 
cept Bp and the slope £ of the population regression line are unknown. Therefore, we 
must use data to estimate these unknown coefficients. 

This estimation problem is similar to those faced in Chapter 3. For example, suppose 
you want to compare the mean earnings of men and women who recently graduated 
from college. Although the population mean earnings are unknown, we can estimate the 
population means using a random sample of male and female college graduates. Then 
the natural estimator of the unknown population mean earnings for women, for example, 
is the average earnings of the female college graduates in the sample. 

The same idea extends to the linear regression model. We do not know the popu- 
lation value of Beigsssize, the slope of the unknown population regression line relating 
X (class size) and Y (test scores). But just as it was possible to learn about the popula- 
tion mean using a sample of data drawn from that population, so is it possible to learn 
about the population slope ciassSize using a sample of data. 

The data we analyze here consist of test scores and class sizes in 1999 in 420 California 
school districts that serve kindergarten through eighth grade. The test score is the 
districtwide average of reading and math scores for fifth graders. Class size can be mea- 
sured in various ways. The measure used here is one of the broadest, which is the number 
of students in the district divided by the number of teachers—that is, the districtwide 
student-teacher ratio. These data are described in more detail in Appendix 4.1. 

Table 4.1 summarizes the distributions of test scores and class sizes for this 
sample. The average student-teacher ratio is 19.6 students per teacher, and the stan- 
dard deviation is 1.9 students per teacher. The 10th percentile of the distribution of 


Test Scores for 420 K-8 Districts in California in 1999 


Percentile 


Standard 50% 
Average Deviation 10% 25% 40% (median) 60% 75% 90% 


654.2 19.1 630.4 640.0 649.1 654.5 659.4 666.7 679.1 | 
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| FIGURE 4.2 | Scatterplot of Test Score vs. Student-Teacher Ratio (California School District Data) 
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the student-teacher ratio is 173 (that is, only 10% of districts have student-teacher 
ratios below 173), while the district at the 90th percentile has a student-teacher ratio 
of 21.9. 

A scatterplot of these 420 observations on test scores and student-teacher ratios 
is shown in Figure 4.2. The sample correlation is —0.23, indicating a weak negative 
relationship between the two variables. Although larger classes in this sample tend 
to have lower test scores, there are other determinants of test scores that keep the 
observations from falling perfectly along a straight line. 

Despite this low correlation, if one could somehow draw a straight line through 
these data, then the slope of this line would be an estimate of Bciasssize based on these 
data. One way to draw the line would be to take out a pencil and a ruler and to “eye- 
ball” the best line you could. While this method is easy, it is unscientific, and different 
people would create different estimated lines. 

How, then, should you choose among the many possible lines? By far the most 
common way is to choose the line that produces the “least squares” fit to these 
data—that is, to use the ordinary least squares (OLS) estimator. 


The Ordinary Least Squares Estimator 


The OLS estimator chooses the regression coefficients so that the estimated regres- 
sion line is as close as possible to the observed data, where closeness is measured by 
the sum of the squared mistakes made in predicting Y given X. 

As discussed in Section 3.1, the sample average, Y, is the least squares estima- 
tor of the population mean, E(Y); that is, Y minimizes the total squared estimation 
mistakes >;_,(Y; — m)? among all possible estimators m [see Expression (3.2)]. 
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The OLS estimator extends this idea to the linear regression model. Let by and 
bı be some estimators of fp and 6. The regression line based on these estimators is 
by + bX, so the value of Y, predicted using this line is bọ + bX; Thus the mistake 
made in predicting the i™ observation is Y, — (bọ + b1X) = Y, — bo — b,X;. The 
sum of these squared prediction mistakes over all n observations is 


(Y - bo ~ bX)? (4.4) 


The sum of the squared mistakes for the linear regression model in Expression (4.4) 
is the extension of the sum of the squared mistakes for the problem of estimating the 
mean in Expression (3.2). In fact, if there is no regressor, then bı does not enter 
Expression (4.4), and the two problems are identical except for the different notation 
[m in Expression (3.2), bọ in Expression (4.4)]. Just as there is a unique estimator, Y, 
that minimizes Expression (3.2), so there is a unique pair of estimators of By and B, 
that minimizes Expression (4.4). 

The estimators of the intercept and slope that minimize the sum of squared mis- 
takes in Expression (4.4) are called the ordinary least squares (OLS) estimators of 
Bo and fy. 

OLS has its own special notation and terminology. The OLS estimator of By is 
denoted Bos and the OLS estimator of f is denoted ĝi. The OLS regression line, also 
called the sample regression line or sample regression function, is the straight line 
constructed using the OLS estimators: ‘Â + BX. The predicted value of Y; given X; 
based on the OLS regression line, is Y; = Bo + ÊX, The residual for the i observa- 
tion is the difference between Y; and its predicted value: a; = Y, — Ê. 

The OLS estimators, Bo and Bi are sample counterparts of the population coef- 
ficients, By) and 6. Similarly, the OLS regression line, Bo + BX, is the sample coun- 
terpart of the population regression line, By) + 61X; and the OLS residuals, ĉ;, are 
sample counterparts of the population errors, u;. 

You could compute the OLS estimators Bo and ĝi by trying different values of 
bo and b, repeatedly until you find those that minimize the total squared mistakes in 
Expression (4.4); they are the least squares estimates. This method would be tedious, 
however. Fortunately, there are formulas, derived by minimizing Expression (4.4) 
using calculus, that streamline the calculation of the OLS estimators. 

The OLS formulas and terminology are collected in Key Concept 4.2. These 
formulas, which are derived in Appendix 4.2, are implemented in virtually all statisti- 
cal and spreadsheet software. 


OLS Estimates of the Relationship Between Test Scores 
and the Student-Teacher Ratio 


When OLS is used to estimate a line relating the student-teacher ratio to test 
scores using the 420 observations in Figure 4.2, the estimated slope is —2.28, and 
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The OLS Estimator, Predicted Values, and Residuals 


4.2 


The OLS estimators of the slope £, and the intercept Bp are 


A i=1 SXY 
= = ALS) 
By n (x, B x)? ce ( ) 
i=1 i 
Ê = Y = Bx. (4.6) 


Yo ye oe een (4.7) 


a,=Y,-Y¥, ill eer ee (8 (4.8) 


The estimated intercept (fp), slope (ĝi), and residual (û;) are computed from a 
sample of n observations of X; and Y;,i = 1,...,n. These are estimates of the 
unknown true population intercept (£o), slope (61), and error term (u;). 


the estimated intercept is 698.9. Accordingly, the OLS regression line for these 
420 observations is 


TestScore = 698.9 — 2.28 X STR, (4.9) 


where TestScore is the average test score in the district and STR is the student- 


66a 


teacher ratio. The “*” over TestScore in Equation (4.9) indicates that it is the pre- 
dicted value based on the OLS regression line. Figure 4.3 plots this OLS regression 
line superimposed over the scatterplot of the data previously shown in Figure 4.2. 

The slope of —2.28 means that when comparing two districts with class sizes that 
differ by one student per class (that is, STR differs by 1), the district with the larger 
class size has, on average, test scores that are lower by 2.28 points. A difference in the 
student-teacher ratio of two students per class is, on average, associated with a dif- 
ference in test scores of 4.56 points [= —2 Xx (—2.28) |. The negative slope indi- 
cates that districts with more students per teacher (larger classes) tend to do worse 
on the test. 

It is now possible to predict the districtwide test score given a value of the student- 
teacher ratio. For example, for a district with 20 students per teacher, the predicted 


| FIGURE 4.3 | The Estimated Regression Line for the California Data 
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test score is 698.9 — 2.28 X 20 = 653.3. Of course, this prediction will not be exactly 
right because of the other factors that determine a district’s performance. But the 
regression line does give a prediction (the OLS prediction) of what test scores 
would be for that district, based on its student-teacher ratio, absent those other 
factors. 

Is the estimated slope large or small? According to Equation (4.9), for two dis- 
tricts with student-teacher ratios that differ by 2, the predicted value of test scores 
would differ by 4.56 points. For the California data, this difference of two students 
per class is large: It is roughly the difference between the median and the 10" per- 
centile in Table 4.1. The associated difference in predicted test scores, however, is 
small compared to the spread of test scores in the data: 4.56 is slightly less than the 
difference between the median and the 60" percentile of test scores. In other words, 
a difference in class size that is large among these schools is associated with a rela- 
tively small difference in predicted test scores. 


Why Use the OLS Estimator? 


There are both practical and theoretical reasons to use the OLS estimators Bo and ĝi. 
Because OLS is the dominant method used in practice, it has become the common 
language for regression analysis throughout economics, finance (see “The ‘Beta’ of a 
Stock” box), and the social sciences more generally. Presenting results using OLS (or 
its variants discussed later in this text) means that you are “speaking the same lan- 
guage” as other economists and statisticians. The OLS formulas are built into virtu- 
ally all spreadsheet and statistical software packages, making OLS easy to use. 
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The “Beta” of a Stock 


A idea of modern finance is that an 
investor needs a financial incentive to take a 
risk. Said differently, the expected return! on a risky 
investment, R, must exceed the return on a safe, or 
risk-free, investment, Ry. Thus the expected excess 
return, R — Ry;, on a risky investment, like owning 
stock in a company, should be positive. 

At first, it might seem like the risk of a stock 
should be measured by its variance. Much of that 
risk, however, can be reduced by holding other 
stocks in a “portfolio” —in other words, by diversify- 
ing your financial holdings. This means that the right 
way to measure the risk of a stock is not by its vari- 
ance but rather by its covariance with the market. 

The capital asset pricing model (CAPM) formalizes 
this idea. According to the CAPM, the expected excess 
return on an asset is proportional to the expected 
excess return on a portfolio of all available assets (the 
market portfolio). That is, the CAPM says that 


RER B(Rn — Ry), (4.10) 


where Rn is the expected return on the market 
portfolio and £ is the coefficient in the population 
regression of R — Ry on Rm — Ry. In practice, the 
risk-free return is often taken to be the rate of inter- 
est on short-term U.S. government debt. Accord- 
ing to the CAPM, a stock with a B < 1 has less risk 
than the market portfolio and therefore has a lower 


expected excess return than the market portfolio. In 


contrast, a stock with a B > 1 is riskier than the mar- 
ket portfolio and thus commands a higher expected 
excess return. 

The “beta” of a stock has become a workhorse 
of the investment industry, and you can obtain esti- 
mated betas for hundreds of stocks on investment 
firm websites. Those betas typically are estimated 
by OLS regression of the actual excess return on 
the stock against the actual excess return on a broad 
market index. 

The table below gives estimated betas for seven 
USS. stocks. Low-risk sellers and producers of con- 
sumer staples like Wal-Mart and Coca-Cola have 


stocks with low betas; riskier stocks have high betas. 


Company Estimated B 
Wal-Mart (discount retailer) 0.1 
Coca-Cola (soft drinks) 0.6 
Verizon (telecommunications) 0.7 
Google (information technology) 1.0 
General Electric (industrial) 11 
Boeing (aircraft) 13 
Bank of America (bank) 17 


Source: finance.yahoo.com. 


'The return on an investment is the change in its price plus 
any payout (dividend) from the investment as a percentage 
of its initial price. For example, a stock bought on January 1 
for $100, which then paid a $2.50 dividend during the year 
and sold on December 31 for $105, would have a return of 
R = [ ($105 — $100) + $2.50] /$100 = 7.5%. 


The OLS estimators also have desirable theoretical properties. They are analo- 
gous to the desirable properties, studied in Section 3.1, of Y as an estimator of the 
population mean. Under the assumptions introduced in Section 4.4, the OLS esti- 
mator is unbiased and consistent. The OLS estimator is also efficient among a 
certain class of unbiased estimators; however, this efficiency result holds under 
some additional special conditions, and further discussion of this result is deferred 
until Section 5.5. 
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4.3 Measures of Fit and Prediction Accuracy 


Having estimated a linear regression, you might wonder how well that regression line 
describes the data. Does the regressor account for much or for little of the variation 
in the dependent variable? Are the observations tightly clustered around the regres- 
sion line, or are they spread out? 

The R? and the standard error of the regression measure how well the OLS 
regression line fits the data. The R? ranges between 0 and 1 and measures the fraction 
of the variance of Y; that is explained by X;. The standard error of the regression 
measures how far Y; typically is from its predicted value. 


The R2 


The regression R? is the fraction of the sample variance of Y explained by (or predicted 
by) X. The definitions of the predicted value and the residual (see Key Concept 4.2) 
allow us to write the dependent variable Y; as the sum of the predicted value, Ê, plus 
the residual ĉ; 


¥,=¥,+ a; (4.11) 


In this notation, the R’ is the ratio of the sample variance of ¥ to the sample variance of Y. 

Mathematically, the R? can be written as the ratio of the explained sum of squares 
to the total sum of squares. The explained sum of squares (ESS) is the sum of squared 
deviations of the predicted value, Y,, from its average, and the total sum of squares 
(TSS) is the sum of squared deviations of Y; from its average: 


ESS = >(¥%- Y? (4.12) 
i=1 


n 
TSS = X (Y; — YY. (4.13) 
i=1 
Equation (4.12) uses the fact that the sample average OLS predicted value equals Y 
(proven in Appendix 4.3). 

The R? is the ratio of the explained sum of squares to the total sum of squares: 


> _ ESS 


ae (4.14) 


Alternatively, the R? can be written in terms of the fraction of the variance of Y, not 
explained by X;. The sum of squared residuals (SSR) is the sum of the squared OLS 
residuals: 


SSR = Sia}. (4.15) 


i=1 
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It is shown in Appendix 4.3 that TSS = ESS + SSR.Thus the R? also can be expressed 
as 1 minus the ratio of the sum of squared residuals to the total sum of squares: 


R=1-—, (4.16) 


Finally, the R? of the regression of Y on the single regressor X is the square of the 
correlation coefficient between Y and X (Exercise 4.12). 

The R? ranges between 0 and 1. If By = 0, then X; explains none of the variation 
of Y, and the predicted value of Y;is Ê = By = Y [from Equation (4.6)]. In this case, 
the explained sum of squares is 0 and the sum of squared residuals equals the total 
sum of squares; thus the R? is 0. In contrast, if X; explains all of the variation of Y, 
then Y; = Y;for all i, and every residual is 0 (that is, 7; = 0),so that ESS = TSS and 
R? = 1. In general, the R? does not take on the extreme value of 0 or 1 but falls 
somewhere in between. An R? near 1 indicates that the regressor is good at predicting 
Y, while an R? near 0 indicates that the regressor is not very good at predicting Y, 


The Standard Error of the Regression 


The standard error of the regression (SER) is an estimator of the standard deviation 
of the regression error u;. The units of u; and Y; are the same, so the SER is a measure 
of the spread of the observations around the regression line, measured in the units of 
the dependent variable. For example, if the units of the dependent variable are dol- 
lars, then the SER measures the magnitude of a typical deviation from the regression 
line — that is, the magnitude of a typical regression error—in dollars. 

Because the regression errors t4, ..., WU, are unobserved, the SER is computed 
using their sample counterparts, the OLS residuals ĉ4, . . . , û„. The formula for the 
SER is 

L XS SSR 


SER = s} = V sx, where s = cae 2 ai er (4.17) 


where the formula for s} uses the fact (proven in Appendix 4.3 that the sample aver- 
age of the OLS residuals is 0. 

The formula for the SER in Equation (4.17) is similar to the formula for the 
sample standard deviation of Y given in Equation (3.7) in Section 3.2, except that 
Y, — Y in Equation (3.7) is replaced by i; and the divisor in Equation (3.7) is n — 1, 
whereas here it ism — 2.The reason for using the divisor n — 2 here (instead of n) is 
the same as the reason for using the divisor n — 1 in Equation (3.7): It corrects for a 
slight downward bias introduced because two regression coefficients were estimated. 
This is called a “degrees of freedom” correction because when two coefficients were 
estimated (6 and 61), two “degrees of freedom” of the data were lost, so the divisor 
in this factor is n — 2. (The mathematics behind this is discussed in Section 5.6.) 
When n is large, the difference among dividing by n, by n — 1, or by n — 2 is 
negligible. 
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Prediction Using OLS 


The predicted value Ê for the i" observation is the value of Y predicted by the OLS 
regression line when X takes on its value X; for that observation. This is called an 
in-sample prediction because the observation for which the prediction is made was 
also used to estimate the regression coefficients. 

In practice, prediction methods are used to predict Y when X is known but Y is 
not. Such observations are not in the data set used to estimate the coefficients. Pre- 
diction for observations not in the estimation sample is called out-of-sample 
prediction. 

The goal of prediction is to provide accurate out-of-sample predictions. For 
example, in the father’s prediction problem, he was interested in predicting test 
scores for a district that had not reported them, using that district’s student-teacher 
ratio. In the linear regression model with a single regressor, the predicted value for 
an out-of-sample observation that takes on the value X is Ŷ = Bo + BX š 

Because no prediction is perfect, a prediction should be accompanied by an 
estimate of its accuracy —that is, by an estimate of how accurate the prediction 
might reasonably be expected to be. A natural measure of that accuracy is the stan- 
dard deviation of the out-of-sample prediction error, Y — Y. Because Y is not 
known, this out-of-sample standard deviation cannot be estimated directly. If, how- 
ever, the observation being predicted is drawn from the same population as the 
data used to estimate the regression coefficients, then the standard deviation of the 
out-of-sample prediction error can be estimated using the sample standard devia- 
tion of the in-sample prediction error, which is the standard error of the regression. 
A common way to report a prediction and its accuracy is as the prediction + the 
SER—that is, Y + sp. More refined measures of prediction accuracy are intro- 
duced in Chapter 14. 


Application to the Test Score Data 


Equation (4.9) reports the regression line, estimated using the California test score 
data, relating the standardized test score (TestScore) to the student-teacher ratio 
(STR). The R? of this regression is 0.051, or 5.1%, and the SER is 18.6. 

The R? of 0.051 means that the regressor STR explains 5.1% of the variance of 
the dependent variable TestScore. Figure 4.3 superimposes the sample regression line 
on the scatterplot of the TestScore and STR data. As the scatterplot shows, the student- 
teacher ratio explains some of the variation in test scores, but much variation remains 
unaccounted for. 

The SER of 18.6 means that the standard deviation of the regression residuals is 
18.6, where the units are points on the standardized test. Because the standard devia- 
tion is a measure of spread, the SER of 18.6 means that there is a large spread of the 
scatterplot in Figure 4.3 around the regression line as measured in points on the test. 
This large spread means that predictions of test scores made using only the student- 
teacher ratio for that district will often be wrong by a large amount. 
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4.4 


What should we make of this low R? and large SER? The fact that the R? of this 
regression is low (and the SER is large) does not, by itself, imply that this regression 
is either “good” or “bad.” What the low R? does tell us is that other important factors 
influence test scores. These factors could include differences in the student body 
across districts, differences in school quality unrelated to the student-teacher ratio, 
or luck on the test. The low R? and high SER do not tell us what these factors are, but 
they do indicate that the student-teacher ratio alone explains only a small part of the 
variation in test scores in these data. 


The Least Squares Assumptions 
for Causal Inference 


In the test score example, the sample regression line, estimated using California district- 
level data, provides an answer to the father’s problem of predicting the test score in 
a district when he knows its student-teacher ratio but not its test score. 

The superintendent, however, is not interested in predicting test scores: She 
wants to improve them in her district. For that purpose, she needs to know the causal 
effect on test scores if she were to reduce the student-teacher ratio. Said differently, 
the superintendent has in mind a very particular definition of B,: the causal effect on 
test scores of an intervention that changes the student-teacher ratio. 

When £; is defined to be the causal effect, whether it is well estimated by OLS 
depends on the nature of the data. As discussed in Section 3.5, the difference in 
means between the treatment and control groups in an ideal randomized experiment 
is an unbiased estimator of the causal effect of a binary treatment; that is, if X is 
randomly assigned, the causal effect of the treatment is E(Y|X = 1) — E(Y|X = 0). 
The difference in means is a workhorse statistical tool that can be used for many 
purposes; when X is randomly assigned, it provides an unbiased estimate of the 
causal effect of a binary treatment. This logic extends to the linear regression model 
and the least squares estimator. 

In this section, we define £; to be the causal effect of a unit change in X. Because 
X can take on multiple values, the causal effect of a given change in X, Ax, is B, Ax, 
where the Greek letter A (delta) stands for “change in.” This definition of the coef- 
ficient on the variable of interest (for example, STR) as its causal effect is maintained 
through Chapter 13. 

This section lays out three mathematical assumptions under which OLS estimates 
the causal effect. The first assumption translates the idea that X is randomly assigned, 
or as-if randomly assigned, into the language of linear regression. The other two 
assumptions are technical ones under which the sampling distributions of the OLS 
estimators can be approximated by a normal distribution in large samples. These latter 
two assumptions are extensions of the two assumptions underlying the weak law of 
large numbers (Key Concept 2.6) and central limit theorem (Key Concept 2.7) 
for the sample mean Y: that the data are i.i.d. and that outliers are unlikely. 
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Assumption 1: The Conditional Distribution of u; 
Given X; Has a Mean of Zero 


The first least squares assumption translates into the language of regression analysis 
the requirement that, for estimation of the causal effect, X must be randomly assigned 
or as-if randomly assigned. To make this translation, we first need to be more specific 
about what the error term w;is. 

In the test score example, class size is just one of many facets of elementary 
education. One district might have better teachers, or it might use better textbooks. 
Two districts with comparable class sizes, teachers, and textbooks still might have very 
different student populations; perhaps one district has more immigrants (and thus 
fewer native English speakers) or wealthier families. Finally, even if two districts are 
the same in all these ways, they might have different test scores for essentially random 
reasons having to do with the performance of the individual students on the day of the 
test or errors in recording their scores. The error term in the class size regression rep- 
resents the contribution to test scores made by all these other, omitted factors. 

The first least squares assumption is that the conditional distribution of u; given 
X; has a mean of 0. This assumption is a formal mathematical statement about the 
other factors contained in u; and asserts that these other factors are unrelated to X; 
in the sense that, given a value of X;, the mean of the distribution of these other fac- 
tors is 0. 


The conditional mean of u in a randomized controlled experiment. In a random- 
ized controlled experiment with binary treatment, subjects are randomly assigned to 
the treatment group (X = 1) or to the control group (X = 0). When random 
assignment is done using a computer program that uses no information about the 
subject, X is distributed independently of the subject’s personal characteristics, 
including those that determine Y. Because of random assignment, the conditional 
mean of u given X is 0. Because regression analysis models the conditional mean, X 
does not need to be distributed independently of all the other factors comprising u. 
However, the mean of u cannot be related to X; that is, E(u;| X;) = 0. 

In observational data, X is not randomly assigned in an experiment. Instead, the 
best that can be hoped for is that X is as if randomly assigned, in the precise sense 
that E(u;|X;) = 0. Whether this assumption holds in a given empirical application 
with observational data requires careful thought and judgment, and we return to this 
issue repeatedly. 


Correlation and conditional mean. Recall from Section 2.3 that if the conditional 
mean of one random variable given another is 0, then the two random variables have 0 
covariance and thus are uncorrelated [Equation (2.28)]. Thus the conditional mean 
assumption E(u;|X;) = 0 implies that X; and u; are uncorrelated, or corr(X;, u;) = 0. 
Because correlation is a measure of linear association, this implication does not go 
the other way; even if X; and u; are uncorrelated, the conditional mean of u; given X; 
might be nonzero (see Figure 3.3). However, if X; and u; are correlated, then it must 
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be the case that E(u;| X;) is nonzero. It is therefore often convenient to discuss the 
conditional mean assumption in terms of possible correlation between X; and u;. If 
X; and u; are correlated, then the conditional mean assumption is violated. 


Assumption 2: (X;, Yi), i= 1,...,n, Are Independently 
and Identically Distributed 


The second least squares assumption is that (X;, Y;),i = 1,...,, are independently 
and identically distributed (i.i.d.) across observations. As discussed in Section 2.5 
(Key Concept 2.5), this assumption is a statement about how the sample is drawn. If 
the observations are drawn by simple random sampling from a single large popula- 
tion, then (X;, Y;),i = 1,...,n, are iid. For example, let X be the age of a worker 
and Y be his or her earnings, and imagine drawing a person at random from the 
population of workers. That randomly drawn person will have a certain age and earn- 
ings (that is, X and Y will take on some values). If a sample of n workers is drawn 
from this population, then (X;, Y;),i = 1,...,n,necessarily have the same distribu- 
tion. If they are drawn at random, they are also distributed independently from one 
observation to the next; that is, they are i.i.d. 

The i.i.d. assumption is a reasonable one for many data collection schemes. For 
example, survey data from a randomly chosen subset of the population typically can 
be treated as iid. 

Not all sampling schemes produce 1.i.d. observations on (X; Y;). One example is 
when the values of X are not drawn from a random sample of the population but 
rather are set by a researcher as part of an experiment. For example, suppose a hor- 
ticulturalist wants to study the effects of different organic weeding methods (X) on 
tomato production (Y) and accordingly grows different plots of tomatoes using dif- 
ferent organic weeding techniques. If she picks the technique (the level of X) to be 
used on the i" plot and applies the same technique to the i plot in all repetitions of 
the experiment, then the value of X; does not change from one sample to the next. 
Said differently, X is fixed in repeated experiments —that is, repeated draws of the 
sample. Thus X; is nonrandom (although the outcome Y; is random), so the sampling 
scheme is not 1.i.d. The results presented in this chapter developed for i.i.d. regressors 
are also true if the regressors are nonrandom. The case of a nonrandom regressor is, 
however, quite special. For example, modern experimental protocols would have the 
horticulturalist assign the level of X to the different plots using a computerized ran- 
dom number generator, thereby circumventing any possible bias by the horticultural- 
ist (she might use her favorite weeding method for the tomatoes in the sunniest plot). 
When this modern experimental protocol is used, the level of X is random, and 
(X, Y;) are iid. 

Another example of non-i.i.d. sampling is when observations refer to the same 
unit of observation over time. For example, we might have data on inventory levels 
(Y) at a firm and the interest rate at which the firm can borrow (X), where these data 
are collected over time from a specific firm; for example, they might be recorded four 
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times a year (quarterly) for 30 years. This is an example of time series data, and a key 
feature of time series data is that observations falling close to each other in time are 
not independent but rather tend to be correlated with each other: If interest rates are 
low now, they are likely to be low next quarter. This pattern of correlation violates 
the “independence” part of the i.i.d. assumption. Time series data introduce a set of 
complications that are best handled after developing the basic tools of regression 
analysis, so we postpone discussion of time series data until Chapter 15. 


Assumption 3: Large Outliers Are Unlikely 


The third least squares assumption is that large outliers—that is, observations with 
values of X;, Y;, or both that are far outside the usual range of the data—are unlikely. 
Large outliers can make OLS regression results misleading. This potential sensitivity 
of OLS to extreme outliers is illustrated in Figure 4.4 using hypothetical data. 

In this book, the assumption that large outliers are unlikely is made mathemati- 
cally precise by assuming that X and Y have nonzero finite fourth moments: 
0 < E(X?) < œ% and 0 < E(Y}) < ~. Another way to state this assumption is 
that X and Y have finite kurtosis. 

The assumption of finite kurtosis is used in the mathematics that justify the large- 
sample approximations to the distributions of the OLS test statistics. For example, we 
encountered this assumption in Chapter 3 when discussing the consistency of the sam- 
ple variance. Specifically, Equation (3.9) states that the sample variance is a consistent 


estimator of the population variance oẸ (s¥ 4 oy). If ¥,..., Y, are iid. and the 
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fourth moment of Y; is finite, then the law of large numbers in Key Concept 2.6 
applies to the average, +;_,Y/,a key step in the proof in Appendix 3.3 showing that 
sy is consistent. 

One source of large outliers is data entry errors, such as a typographical error or 
incorrectly using different units for different observations. Imagine collecting data on 
the height of students in meters but inadvertently recording one student’s height in 
centimeters instead. This would create a large outlier in the sample. One way to find 
outliers is to plot your data. If you decide that an outlier is due to a data entry error, 
then you can either correct the error or, if that is impossible, drop the observation 
from your data set. 

Data entry errors aside, the assumption of finite kurtosis is a plausible one in 
many applications with economic data. Class size is capped by the physical capacity 
of a classroom; the best you can do on a standardized test is to get all the questions 
right, and the worst you can do is to get all the questions wrong. Because class size 
and test scores have a finite range, they necessarily have finite kurtosis. More gener- 
ally, commonly used distributions such as the normal distribution have four moments. 
Still, as a mathematical matter, some distributions have infinite fourth moments, and 
this assumption rules out those distributions. If the assumption of finite fourth 
moments holds, then it is unlikely that statistical inferences using OLS will be domi- 
nated by a few observations. 


Use of the Least Squares Assumptions 


The three least squares assumptions for the linear regression model are summarized 
in Key Concept 4.3. The least squares assumptions play twin roles, and we return to 
them repeatedly throughout this text. 

Their first role is mathematical: If these assumptions hold, then, as is shown in 
the next section, in large samples the OLS estimators are consistent and have sam- 
pling distributions that are normal. This large-sample normal distribution underpins 
methods for testing hypotheses and constructing confidence intervals using the OLS 
estimators. 


The Least Squares Assumptions for Causal Inference 


4.3 
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where 6; is the causal effect on Y of X, and: 


1. The error term u; has conditional mean 0 given X;: E(u;|X;) = 0; 
2. (X;, Y;),i = 1,...,n, are independent and identically distributed (i.i.d.) 
draws from their joint distribution; and 


3. Large outliers are unlikely: X; and Y; have nonzero finite fourth moments. 
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Their second role is to organize the circumstances that pose difficulties for OLS 
estimation of the causal effect 64. As we will see, the first least squares assumption is 
the most important to consider in practice. One reason why the first least squares 
assumption might not hold in practice is discussed in Chapter 6, and additional rea- 
sons are discussed in Section 9.2. 

It is also important to consider whether the second assumption holds in an appli- 
cation. Although it plausibly holds in many cross-sectional data sets, the indepen- 
dence assumption is inappropriate for panel and time series data. In those settings, 
some of the regression methods developed under assumption 2 require modifica- 
tions. Those modifications are developed in Chapters 10 and 15-17. 

The third assumption serves as a reminder that OLS, just like the sample mean, can be 
sensitive to large outliers. If your data set contains outliers, you should examine them care- 
fully to make sure those observations are correctly recorded and belong in the data set. 

The assumptions in Key Concept 4.3 apply when the aim is to estimate the causal 
effect—that is, when £; is the causal effect. Appendix 4.4 lays out a parallel set of 
least squares assumptions for prediction and discusses their relation to the assump- 
tions in Key Concept 4.3. 


The Sampling Distribution of the OLS 
Estimators 


Because the OLS estimators fy and ĝ; are computed from a randomly drawn sample, 
the estimators themselves are random variables with a probability distribution—the 
sampling distribution — that describes the values they could take over different possible 
random samples. In small samples, these sampling distributions are complicated, but in 
large samples, they are approximately normal because of the central limit theorem. 


Review of the sampling distribution of Y. Recall the discussion in Sections 2.5 and 
2.6 about the sampling distribution of the sample average, Y, an estimator of the 
unknown population mean of Y, wy. Because Y is calculated using a randomly drawn 
sample, Y is a random variable that takes on different values from one sample to the 
next; the probability of these different values is summarized in its sampling distribu- 
tion. Although the sampling distribution of Y can be complicated when the sample 
size is small, it is possible to make certain statements about it that hold for all n. In 
particular, the mean of the sampling distribution is wy, that is, E(Y) = wy,so Y is an 
unbiased estimator of uy. If n is large, then more can be said about the sampling 
distribution. In particular, the central limit theorem (Section 2.6) states that this dis- 
tribution is approximately normal. 


The sampling distribution of Bo and Bi. These ideas carry over to the OLS estima- 
tors By and £; of the unknown intercept Bp and slope 6 of the population regression 
line. Because the OLS estimators are calculated using a random sample, By and B; are 
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random variables that take on different values from one sample to the next; the prob- 
ability of these different values is summarized in their sampling distributions. 

Although the sampling distribution of Ê and B; can be complicated when the 
sample size is small, it is possible to make certain statements about it that hold for all 
n. In particular, the means of the sampling distributions of Bo and Bi are By and 64. In 
other words, under the least squares assumptions in Key Concept 4.3, 


E(Bo) = Bo and E(B.) = Bi; (4.18) 


that is, Bo and Bi are unbiased estimators of fp and B,;. The proof that Bi is unbiased 
is given in Appendix 4.3, and the proof that Bp is unbiased is left as Exercise 4.7 

If the sample is sufficiently large, by the central limit theorem the joint sampling dis- 
tribution of Bo and By is well approximated by the bivariate normal distribution (Section 2.4). 
This implies that the marginal distributions of Êo and B; are normal in large samples. 

This argument invokes the central limit theorem. Technically, the central limit 
theorem concerns the distribution of averages (like Y). If you examine the numera- 
tor in Equation (4.5) for Bi you will see that it, too, is a type of average — not a simple 
average, like Y, but an average of the product, (Y; — Y)(X,; — X). As discussed fur- 
ther in Appendix 4.3, the central limit theorem applies to this average, so that, like 
the simpler average Y, it is normally distributed in large samples. 

The normal approximation to the distribution of the OLS estimators in large 
samples is summarized in Key Concept 4.4. (Appendix 4.3 summarizes the derivation 
of these formulas.) A relevant question in practice is how large n must be for these 
approximations to be reliable. In Section 2.6, we suggested that n = 100 is suffi- 
ciently large for the sampling distribution of Y to be well approximated by a normal 
distribution, and sometimes a smaller n suffices. This criterion carries over to the more 
complicated averages appearing in regression analysis. In virtually all modern 


Large-Sample Distributions of By and p; 


4.4 


If the least squares assumptions in Key Concept 4.3 hold, then in large samples 
Ê and Ê; have a jointly normal sampling distribution. The large-sample normal 
distribution of By is N(Bi, oA), where the variance of this distribution, Th, is 
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econometric applications, n > 100, so we will treat the normal approximations to the 
distributions of the OLS estimators as reliable unless there are good reasons to think 
otherwise. 

The results in Key Concept 4.4 imply that the OLS estimators are consistent; that is, 
when the sample size is large and the least squares assumptions hold, By and B; will be 
close to the true population coefficients By) and 6, with high probability. This is because 
the variances of, and op, of the estimators decrease to 0 as n increases (n appears in the 
denominator of the formulas for the variances), so the distribution of the OLS estimators 
will be tightly concentrated around their means, By and 64, when n is large. 

Another implication of the distributions in Key Concept 4.4 is that, in general, 
the larger is the variance of X;, the smaller is the variance op, of By. Mathematically, 
this implication arises because the variance of ĝin Equation (4.19) is inversely pro- 
portional to the square of the variance of X;: the larger is var(X;), the larger is the 
denominator in Equation (4.19) so the smaller is op. To get a better sense of why this 
is SO, look at Figure 4.5, which presents a scatterplot of 150 artificial data points on X 
and Y. The data points indicated by the colored dots are the 75 observations closest 
to X. Suppose you were asked to draw a line as accurately as possible through either 
the colored or the black dots— which would you choose? It would be easier to draw 
a precise line through the black dots, which have a larger variance than the colored 
dots. Similarly, the larger the variance of X, the more precise is Bi. 

The distributions in Key Concept 4.4 also imply that the smaller is the variance 
of the error u;, the smaller is the variance of Bi. This can be seen mathematically in 
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4.6 


Equation (4.19) because u; enters the numerator, but not denominator, of op, If all 
u; were smaller by a factor of one-half but the X’s did not change, then og, would be 
smaller by a factor of one-half and oA, would be smaller by a factor of one-fourth 
(Exercise 4.13). Stated less mathematically, if the errors are smaller (holding the X’s 
fixed), then the data will have a tighter scatter around the population regression line, 
so its slope will be estimated more precisely. 

The normal approximation to the sampling distribution of By and ĝisa powerful 
tool. With this approximation in hand, we are able to develop methods for making 
inferences about the true population values of the regression coefficients using only 
a sample of data. 


Conclusion 


This chapter has focused on the use of ordinary least squares to estimate the inter- 
cept and slope of a population regression line using a sample of n observations on a 
dependent variable, Y, and a single regressor, X. The sample regression line, esti- 
mated by OLS, can be used to predict Y given a value of X. When £&; is defined to be 
the causal effect on Y of a unit change in X and the least squares assumptions for 
causal inference (Key Concept 4.3) hold, then the OLS estimators of the slope and 
intercept are unbiased, are consistent, and have a sampling distribution with a vari- 
ance that is inversely proportional to the sample size n. Moreover, if is large, then 
the sampling distribution of the OLS estimator is normal. 

The first least squares assumption for causal inference is that the error term in 
the linear regression model has a conditional mean of 0 given the regressor X. This 
assumption holds if X is randomly assigned in an experiment or is as-if randomly 
assigned in observational data. Under this assumption, the OLS estimator is an unbi- 
ased estimator of the causal effect 64. 

The second least squares assumption is that (X;, Y;) are i.i.d., as is the case if the 
data are collected by simple random sampling. This assumption yields the formula, 
presented in Key Concept 4.4, for the variance of the sampling distribution of the 
OLS estimator. 

The third least squares assumption is that large outliers are unlikely. Stated more 
formally, X and Y have finite fourth moments (finite kurtosis). This assumption is 
needed because OLS can be unreliable if there are large outliers. Taken together, the 
three least squares assumptions imply that the OLS estimator is normally distributed 
in large samples as described in Key Concept 4.4. 

The results in this chapter describe the sampling distribution of the OLS estimator. 
By themselves, however, these results are not sufficient to test a hypothesis about the 
value of 8, or to construct a confidence interval for B;. Doing so requires an estimator 
of the standard deviation of the sampling distribution—that is, the standard error of 
the OLS estimator. This step— moving from the sampling distribution of B; to its stan- 
dard error, hypothesis tests, and confidence intervals —is taken in the next chapter. 
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Summary 


1. 


The population regression line, By) + B,X,is the mean of Y as a function of the 
value of X. The slope, 64, is the expected difference in Y between two observa- 
tions with X values that differ by one unit. The intercept, Bọ, determines the 
level (or height) of the regression line. Key Concept 4.1 summarizes the termi- 
nology of the population linear regression model. 

The population regression line can be estimated using sample observations 
(Y, X;),i = 1,...,7, by ordinary least squares (OLS). The OLS estimators of 
the regression intercept and slope are denoted By and ĝi. The predicted value 
of Y given X is Bo + BX. 

The R? and standard error of the regression (SER) are measures of how 
close the values of Y; are to the estimated regression line. The R? is between 
0 and 1, with a larger value indicating that the Ys are closer to the line. 
The standard error of the regression estimates the standard deviation of the 
regression error. 

There are three key assumptions for estimating causal effects using the linear 
regression model: (1) The regression errors, u; have a mean of 0, conditional 
on the regressors X;; (2) the sample observations are i.i.d. random draws from 
the population; and (3) large outliers are unlikely. If these assumptions hold, 
the OLS estimator f; is (1) an unbiased estimator of the causal effect B;, (2) 
consistent, and (3) normally distributed when the sample is large. 
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Review the Concepts 


4.1 What is a linear regression model? What is measured by the coefficients of a 
linear regression model— intercept Bp and slope B,? What is the ordinary least 
squares estimator? 


4.2 Explain what is meant by the error term. What assumptions do we make about 
the error term when estimating an OLS regression? 


4.3 What is meant by the assumption that a paired sample observations of Y; 
and X; are independently and identically distributed? Why is this an impor- 
tant assumption for OLS estimation? When is this assumption likely to be 
violated? 


4.4 Distinguish between R? and SER. How do each of these measures describe 
the fit of a regression? 


Exercises 


4.1 Suppose that a researcher, using data on class size (CS) and average test scores 
from 50 third-grade classes, estimates the OLS regression: 


aS 
TestScore = 640.3 — 4.93 X CS, R? = 0.11, SER = 8.7. 


a. A classroom has 25 students. What is the regression’s prediction for that 
classroom’s average test score? 


b. Last year a classroom had 21 students, and this year it has 24 students. 
What is the regression’s prediction for the change in the classroom average 
test score? 


c. The sample average class size across the 50 classrooms is 22.8. What is 
the sample average of the test scores across the 50 classrooms? (Hint: 
Review the formulas for the OLS estimators.) 


d. What is the sample standard deviation of test scores across the 50 
classrooms? (Hint: Review the formulas for the R? and SER.) 
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4.2 A random sample of 100 20-year-old men is selected from a population and these 


4.3 


4.4 


men’s height and weight are recorded. A regression of weight on height yields 
—_ 
Weight = —79.24 + 4.16 X Height, R? = 0.72, SER = 12.6, 


where Weight is measured in pounds and Height is measured in inches. 


a. What is the regression’s weight prediction for someone who is 64 inches 
tall? 68 inches tall? 72 inches tall? 


b. A man has a late growth spurt and grows 2 inches over the course of a year. 
What is the regression’s prediction for the increase in this man’s weight? 


c. Suppose that instead of measuring weight and height in pounds and 
inches, these variables are measured in centimeters and kilograms. What 
are the regression estimates from this new centimeter-kilogram regres- 
sion? (Give all results, estimated coefficients, R?, and SER.) 


A regression of average monthly expenditure (AME, measured in dollars) on aver- 
age monthly income (AMI, measured in dollars) using a random sample of college- 
educated full-time workers earning €100 to €1.5 million yields the following: 


——~. 
AME = 710.7 + 8.8 X AMI, R? = 0.030, SER = 540.30 


a. Explain what the coefficient values 710.7 and 8.8 mean. 


b. The standard error of the regression (SER) is 540.30. What are the units 
of measurement for the SER? (Euros? Or is it unit free?) 


c. The regression R? is 0.030. What are the units of measurement for the R7? 
(Euros? Or is R? unit free?) 


d. What does the regression predict will be the expenditure of a person 
with an income of €100? With an income of €200? 


e. Will the regression give reliable predictions for a person with an income 
of €2 million? Why or why not? 


f. Given what you know about the distribution of earnings, do you think it is 
plausible that the distribution of errors in the regression is normal? (Hint: 
Do you think that the distribution is symmetric or skewed? What is the 
smallest value of earnings, and is it consistent with a normal distribution?) 


Your class is asked to investigate the effect of average temperature on aver- 
age weekly earnings (AWE, measured in dollars) across countries, using the 
following general regression approach: 


aae A A 
AWE = po + fı X temperature 
One of your classmates, Rachel, is an American and decides to analyze the 


effect of temperature measured in Fahrenheit, while most of the other stu- 
dents analyze the effect of temperature measured in Celsius. 


Xp= 324 2X Xo 
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4.5 


4.6 


4.7 


4.8 


4.9 


If everything else is the same in Rachel’s analysis compared to the other stu- 
dents’ analysis, then how will the following quantities differ? 


a. By (Hint: Review Key Concept 2.3) 

b. Êi 

c. R? (Hint: R? is equal to the square of the correlation coefficient, 
which can be obtained using Equation 2.26) 


A researcher runs an experiment to measure the impact of a short nap on 
memory. There are 200 participants and they can take a short nap of either 
60 minutes or 75 minutes. After waking up, each participant takes a short 
test for short-term recall. Each participant is randomly assigned one of the 
examination times, based on the flip of a coin. Let Y; denote the number of 
points scored on the test by the i" participant (0 = Y, = 100), let X; denote 
the amount of time for which the participant slept prior to taking the test 
(X; = 60 or 75), and consider the regression model Y; = By + BX; + ui 


a. Explain what the term u; represents. Why will different participants have 
different values of u;? 
b. What is E(u;| X)? Are the estimated coefficients unbiased? 
c. What concerns might the researcher have about ensuring compliance 
among participants? 
d. The estimated regression is Y; = 55 + 0.17 X, 
i. Compute the estimated regression’s prediction for the average 
score of participants who slept for 60 minutes before taking the test. 
Repeat for 75 minutes and 90 minutes. 


ii. Compute the estimated gain in score for a participant who is given an 
additional 5 minutes to nap. 


Show that the first least squares assumption, E(u;| X) = 0, implies that 
E(¥;|X)) = Bo + BX; 

Show that Bo is an unbiased estimator of Bp. (Hint: Use the fact that Bi is unbiased, 
which is shown in Appendix 4.3.) 


Suppose all of the regression assumptions in Key Concept 4.3 are satisfied 
except that the first assumption is replaced with E(u;| X) = 2. Which parts 
of Key Concept 4.4 continue to hold? Which change? Why? (Is By normally 
distributed in large samples with mean and variance given in Key Concept 4.4? 
What about Bo?) 


a. A linear regression yields By = 0. Show that R? = 0. 


b. A linear regression yields R? = 0. Does this imply that B, = 0? 


4.10 


4.11 


4.12 


4.13 


4.14 
4.15 
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Suppose Y, = By + BX; + u; where (X, u;) are i.i.d. and X; is a Bernoulli 
random variable with Pr(X = 1) = 0.30. When X = 1, u; is N(0, 3); when 
X = 0, u; is N(0,2). 


a. Show that the regression assumptions in Key Concept 4.3 are satisfied. 


b. Derive an expression for large-sample variance of Bi. | Hint: Evaluate the 
terms in Equation (4.19).] 


Consider the regression model Y; = By + BX; + u; 


a. Suppose you know that 8) = 0. Derive a formula for the least squares 
estimator of 64. 


b. Suppose you know that By = 4. Derive a formula for the least squares 
estimator of 64. 


a. Show that the regression R? in the regression of Y on X is the squared value 


of the sample correlation between X and Y. That is, show that R? = ry. 


b. Show that the R? from the regression of Y on X is the same as the R? from 
the regression of X on Y. 


c. Show that Ê; = ryy(Sy / Sy), where ryy is the sample correlation between 
X and Y and sy and sy are the sample standard deviations of X and Y. 


Suppose Y; = By + BX; + Ku;, where x is a nonzero constant and (Y, X;) sat- 


isfy the three least squares assumptions. Show that the large-sample variance 
of B, is given by of, = Ceo A [Hint: This equation is the variance 


given in Equation (4.19) multiplied by x7.] 


Show that the sample regression line passes through the point (X, Y). 


(Requires Appendix 4.4) A sample (X;,Y;), i = 1,...,n, is collected from a 
population with E(Y|X) = By + BX and used to compute the least squares 
estimators Bo and Bi. You are interested in predicting the value of Y°” from a 
randomly chosen out-of-sample observation with X°° = x°°°, 


a. Suppose the out-of-sample observation is from the same population as 
the in-sample observations (X;, Y;) and is chosen independently of the 
in-sample observations. 


i. Explain why E(Y°%|X°% = x°) = By + Bix”. 
ii. Let m = Bo + Bix. Show that 
EC? |" = x) = By + Bix. 
iii. Let u°” = Y°% — (By + B,X°) and “ee = yo" — (Ê + X°”). 
Show that var (0°) = var(u°®) + var( bo + BX’). 
b. Suppose the out-of-sample observation is drawn from a different pop- 


ulation than the in-sample population and that the joint distributions 
of X and Y differ for the two populations. Continue to let By and 6; 
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be the coefficients of the population regression line for the in-sample 
population. 


1. Does Ey = a") = Bo ah Bix”? 
ii, Does E( °” | X°” = x°) = By + Bix’? 


Empirical Exercises 


E4.1 On the text website, http://www.pearsonglobaleditions.com, you will find the 


data file Growth, which contains data on average growth rates from 1960 


through 1995 for 65 countries, along with variables that are potentially related 


to growth.' A detailed description is given in Growth_Description, also avail- 


able on the website. In this exercise, you will investigate the relationship 


between growth and trade. 


a. 


C 


Construct a scatterplot of average annual growth rate (Growth) on the 
average trade share (TradeShare). Does there appear to be a relationship 
between the variables? 


One country, Malta, has a trade share much larger than the other coun- 
tries. Find Malta on the scatterplot. Does Malta look like an outlier? 


Using all observations, run a regression of Growth on TradeShare. What 
is the estimated slope? What is the estimated intercept? Use the regres- 
sion to predict the growth rate for a country with a trade share of 0.5 and 
for another with a trade share equal to 1.0. 


Estimate the same regression, excluding the data from Malta. Answer 
the same questions in (c). 


Plot the estimated regression functions from (c) and (d). Using the scat- 
terplot in (a), explain why the regression function that includes Malta is 
steeper than the regression function that excludes Malta. 


Where is Malta? Why is the Malta trade share so large? Should Malta be 
included or excluded from the analysis? 


E4.2 On the text website, http://www.pearsonglobaleditions.com, you will 


find the data file Earnings_and_Height, which contains data on earn- 


ings, height, and other characteristics of a random sample of U.S. workers.” 


'These data were provided by Professor Ross Levine of the University of California at Berkeley and were 
used in his paper with Thorsten Beck and Norman Loayza, “Finance and the Sources of Growth,” Journal 
of Financial Economics, 2000, 58: 261-300. 


? These data were provided by Professors Anne Case (Princeton University) and Christina Paxson (Brown 
University) and were used in their paper “Stature and Status: Height, Ability, and Labor Market Out- 
comes,” Journal of Political Economy, 2008, 116(3): 499-532. 
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A detailed description is given in Earnings_and_Height_Description, also 


available on the website. In this exercise, you will investigate the relationship 


between earnings and height. 


a. 
b. 


C 


What is the median value of height in the sample? 


i. Estimate average earnings for workers whose height is at most 
67 inches. 


ii. Estimate average earnings for workers whose height is greater than 
67 inches. 


iii. On average, do taller workers earn more than shorter workers? How 
much more? What is a 95% confidence interval for the difference in 
average earnings? 

Construct a scatterplot of annual earnings (Earnings) on height (Height). 

Notice that the points on the plot fall along horizontal lines. (There are 

only 23 distinct values of Earnings). Why? (Hint: Carefully read the 

detailed data description.) 

Run a regression of Earnings on Height. 

i. What is the estimated slope? 

ii. Use the estimated regression to predict earnings for a worker who 
is 67 inches tall, for a worker who is 70 inches tall, and for a worker 
who is 65 inches tall. 

Suppose height were measured in centimeters instead of inches. Answer 

the following questions about the Earnings on Height (in cm) regression. 


i. What is the estimated slope of the regression? 
ii. What is the estimated intercept? 
iii. What is the R°? 
iv. What is the standard error of the regression? 


Run a regression of Earnings on Height, using data for female workers 
only. 


i. What is the estimated slope? 


ii. A randomly selected woman is 1 inch taller than the average 
woman in the sample. Would you predict her earnings to be higher 
or lower than the average earnings for women in the sample? By 
how much? 


Repeat (f) for male workers. 

Do you think that height is uncorrelated with other factors that cause 
earning? That is, do you think that the regression error term, u; has a 
conditional mean of 0 given Height (X;)? (You will investigate this more 
in the Earnings and Height exercises in later chapters.) 
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The California Test Score Data Set 


The California Standardized Testing and Reporting data set contains data on test performance, 
school characteristics, and student demographic backgrounds. The data used here are from all 
420 K-6 and K-8 districts in California with data available for 1999. Test scores are the average 
of the reading and math scores on the Stanford 9 Achievement Test, a standardized test admin- 
istered to fifth-grade students. School characteristics (averaged across the district) include 
enrollment, number of teachers (measured as “full-time equivalents”), number of computers 
per classroom, and expenditures per student. The student-teacher ratio used here is the num- 
ber of students in the district divided by the number of full-time equivalent teachers. Demo- 
graphic variables for the students also are averaged across the district. The demographic variables 
include the percentage of students who are in the public assistance program CalWorks (formerly 
AFDC), the percentage of students who qualify for a reduced-price lunch, and the percent- 
age of students who are English learners (that is, students for whom English is a second 
language). All of these data were obtained from the California Department of Education 


(www.cde.ca.gov). 


Derivation of the OLS Estimators 


This appendix uses calculus to derive the formulas for the OLS estimators given in Key 
Concept 4.2. To minimize the sum of squared prediction mistakes $;-1(Y; — by — b,X;)? 


[Equation (4.4)], first take the partial derivatives with respect to by and b4: 


0 n n 
(Y, — bo — 1X)? = 2X (Y; — bo — b,X;) and (4.21) 
abo 4 i=1 
ð n n 
Jb (Y; — bo b,X;)* = 2X (Y; — bo — bıX;) X; (4.22) 
1i=1 i=1 


The OLS estimators, Ê and Ê, are the values of by and b; that minimize 7 1(¥; — bo — b,X;)? 
or, equivalently, the values of bọ and b, for which the derivatives in Equations (4.21) and (4.22) 
equal 0. Accordingly, setting these derivatives equal to 0, collecting terms, and dividing by n 


shows that the OLS estimators, Bo and Êi, must satisfy the two equations 


Y- Â - ÂX = (4.23) 


(4.24) 
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Solving this pair of equations for By and B, yields 
TA us 
— Ay -XY (X; - AHF) 
a i=1 i=1 
B= 12 = a (4.25) 
2X = (RP (X; — X)? 
nz i=] 
Âo = Y al ÊX. (4.26) 


Equations (4.25) and (4.26) are the formulas for Bo and ĝi given in Key Concept 4.2; the formula 
ĝ = Sxy/S% is obtained by dividing the numerator and denominator in Equation (4.25) 


byn — 1. 


Sampling Distribution of the OLS Estimator 


In this appendix, we show that the OLS estimator B; is unbiased and, in large samples, has the 


normal sampling distribution given in Key Concept 4.4. 


Representation of 64 in Terms of the Regressors and Errors 

We start by providing an expression for Bi in terms of the regressors and errors. Because 
Y, = bo + BX; + u; Y; — Y = B(X; — X) + ui 
in Equation (4.25) is 


u, so the numerator of the formula for Bi 


S- XY- P) = DOH — DAK X) + (u — w) 
= D(X- X)? + SOX - X) (u — 0). (427) 
Now SK — X) (u = 0) = BIG Xu - BH - Pa = BIH - us 


where the final equality follows from the definition of X, which implies that X;-1(X; — X)u = 
(Xi=-1X; — nX)u = 0.Substituting X;=1(X; — X) (u; — u) = d7-1(X}; — X)u;,into the final 
expression in Equation (4.27) yields S/_,(X; -— X)(¥;- Y) = B,>7-1(X%} - X)? + 
>7_,(X; — X)u;. Substituting this expression in turn into the formula for Ê, in Equation (4.25) 


yields 
15%- Xu 
Bhan (4.28) 
= (= x) 
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Proof That By Is Unbiased 


The argument that Ĝi is unbiased under the first least squares assumption uses the law of iterated 
expectations [Equation (2.20)]. First, obtain E(B; |X), . .. , X) by taking the conditional expec- 
tation of both sides of Equation (4.28): 


1 n o 
— > (X — X)ui 
^ niži 
E(Bi|X4,...,Xn) = fi FE {a X,..., Xp 
—> (X= X) 
niži 
1 n 
-> (X ~ X)E(u;| X, s, Xn) 
ni=1i 
= B, 4 ia ; (4.29) 
D ~ XY? 
ni=1 


By the second least squares assumption, u; is distributed independently of X for all observations 
other than i,so E(u;|Xj,...,X,) = E(u;|X;). By the first least squares assumption, however, 
E(u;|X;) = 0. Thus the second term in the final line of Equation (4.29) is 0, from which it fol- 
lows that E(B, |X,...,X,) = Bı. 

Because Ĝi is unbiased given X, . . . , X» it is unbiased after averaging over all samples 
Xi, ..., Xp- Thus the unbiasedness of Bi follows Equation (4.29) and the law of iterated expec- 
tations: E(Â1) = E[E(Bi|X,...,X,)] = bi 


Large-Sample Normal Distribution 
of the OLS Estimator 


The large-sample normal approximation to the limiting distribution of Bi (Key Concept 4.4) 
is obtained by considering the behavior of the final term in Equation (4.28). 

First, consider the numerator of this term. Because X is consistent, if the sample size is 
large, X is nearly equal to uy. Thus, to a close approximation, the term in the numerator of 
Equation (4.28) is the sample average v, where v; = (X; — wx)u;. By the first least squares 
assumption, v; has a mean of 0. By the second least squares assumption, v; is i.i.d. The variance 
of v; is of = [var(X; — uy)u;], which, by the third least squares assumption, is nonzero and 
finite. Therefore, v satisfies all the requirements of the central limit theorem (Key Concept 2.7). 
Thus /o; is, in large samples, distributed N(0, 1), where 02 = 02 /n. Therefore the distribu- 
tion of V is well approximated by the N(0, o? /n) distribution. 

Next consider the expression in the denominator in Equation (4.28); this is the sample vari- 
ance of X (except dividing by n rather than n — 1, which is inconsequential if n is large). As 
discussed in Section 3.2 [Equation (3.8)], the sample variance is a consistent estimator of the 
population variance, so in large samples it is arbitrarily close to the population variance of X. 

Combining these two results, we have that, in large samples, 8, — B, = V/var(X;), 
so that the sampling distribution of B, is, in large samples, N(fj, of ), where 
oA, = var(Vv) /[var(X;) ]* = var[ (X; — wy)u;]/{n| var(X;) ]?}, which is the expression in 
Equation (4.19). 
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Some Additional Algebraic Facts About OLS 
The OLS residuals and predicted values satisfy 


I 
i ;= 0 4.30 
Sao aso 
Iia = 
17 <9 (4.31) 
ni=1 
5 û;X; = 0 and sy = 0, and (4.32) 
sl 
TSS = SSR + ESS. (4.33) 


Equations (4.30) through (4.33) say that the sample average of the OLS residuals is 0; the sample 
average of the OLS predicted values equals Y; the sample covariance s} y between the OLS residuals 
and the regressors is 0; and the total sum of squares is the sum of squared residuals and the explained 
sum of squares. [The ESS, TSS, and SSR are defined in Equations (4.12), (4.13), and (4.15).] 

To verify Equation (4.30), note that the definition of By lets us write the OLS residuals as 
it; = Y, — Bo — ÊX; = (Y; — Y) - ĝı(X; — X); thus 


n n 


yi; = X (Y - Y) - Prd (% - X). 


i=1 i=1 


But the definitions of Y and X imply that >/_,(Y; — Y) = 0 and X;-1(X; — X) = 0, so 
TaD. 

To verify Equation (4.31), note that Y, = Ê + û,so XY, = XÊ + DL = DÊ, 
where the second equality is a consequence of Equation (4.30). 

To verify Equation (4.32), note that >j_,; = 0 implies X;-10;:X; = X;-1û;/(X; — X), so 


Sax = S10- P - AX -DX - X) 


= XO- DA -X -ÂE - XP = 0, (4.34) 
where the final equality in Equation (4.34) is obtained using the formula for Ĝi in 
Equation (4.25). This result, combined with the preceding results, implies that sz = 0. 


Equation (4.33) follows from the previous results and some algebra: 


TSS = XY- YP = XY- Ê+ Ê- VP 
i=1 i 


= 2% PP + EÈ- XP ee SH Hey) 
= SSR + ESS + 2¥.4,Y, = SSR + ESS, (4.35) 


i=1 


where the final equality follows from Say = SB + ÊX) = AX- û; + 
Ê >} 1i1,X; = 0 by the previous results. 
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The Least Squares Assumptions 
for Prediction 


Section 4.4 provides the least squares assumptions for estimation of a causal effect. There is a 
parallel set of least squares assumptions for prediction. The difference between the two stems 
from the difference between the two problems. For estimation of a causal effect, X must be 
randomly assigned or as-if randomly assigned, which leads to least squares assumption 1 in 
Key Concept 4.3. In contrast, as discussed in Section 4.3, the goal of prediction is to provide 
accurate out-of-sample predictions. To do so, the estimated regression line must be relevant to 
the observation being predicted. This is the case if the data used for estimation and the obser- 
vation being predicted are drawn from the same population distribution. 

For example, return to the superintendent’s and father’s problems. The superintendent is 
interested in the causal effect on TestScore of a change in STR. Ideally, to answer her question 
we would have data from an experiment in which students were randomly assigned to different 
size classes. Absent such an experiment, she may or may not be satisfied with the regression 
of TestScore on STR using California data—that depends on whether least squares assumption 
1 is satisfied where £ is defined to be the causal effect. 

In contrast, the father is interested in predicting test scores in a California district that did 
not report its test scores, so for his purposes he is interested in the population regression line 
relating TestScore and STR in California, the slope of which may or may not be the causal effect. 

To make this precise, we introduce some additional notation. Let (X°°,Y°”) denote the 
out-of-sample (“oos”) observation for which the prediction is to be made, and continue to let 
(X, Y ),i = 1,...,n, be the data used to estimate the regression coefficients. The least 
squares assumptions for prediction are 

E(Y|X) = By + BX andu = Y — E(Y|X), where 


1. (xe Y°°*) are randomly drawn from the same population distribution as 
(X, Y,),6=1,...,% 

2. (X, ¥;),i = 1,...,n, are independent and identically distributed (i.i.d.) draws 
from their joint distribution; and 


3. Large outliers are unlikely: X; and Y; have nonzero finite fourth moments. 


There are two differences between these assumptions and the assumptions in Key 
Concept 4.3. The first is the definition of £4. The best predictor is given by E(Y|X) (where the 
best predictor is defined in terms of the mean squared prediction error; see Appendix 2.2). 
With the assumption of linearity, for prediction B, is defined to be the slope of this conditional 
expectation, which may or may not be the causal effect. Second, because the regression line is 
estimated using in-sample observations but is used to predict an out-of-sample observation, 
the first assumption is that these are drawn from the same population. 

The second and third assumptions are the same as those for estimation of causal effects 
in Section 4.4. They ensure that the OLS estimators are consistent for the coefficients of the 


population prediction line and are normally distributed when n is large. 
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Under the least squares assumptions for prediction, the OLS predicted value of Y°® is 


unbiased: 


Bl yo | X0 = arar) = E(Bo E Bo Xes |X = a”) 


A A 


= E(By) + E(B, )x°* (4.36) 


where the second equality follows because (X°°,Y°) are independent of the observations 
used to compute the OLS estimators. For the prediction problem, u is defined to 
be u = Y — E(Y|X), so by definition E(u;|X;) = 0 and the algebra in Appendix 4.3 
applies directly. Thus E(B) + E(f,)x°° = Bo + Bix?” = E(Y¥%|X° = x), Combining 
this expression with the first expression in Equation (4.36), we have that 
E( Y°% — ¥0| X° = x95) = 0; that is, the OLS prediction is unbiased. 

The least squares assumptions for prediction also ensure that the regression SER esti- 
mates the variance of the out-of-sample prediction error, #°° = Y°” — Pes, To show this, it is 
useful to write the out-of-sample prediction error as the sum of two terms: the error that would 
be made were the regression coefficients known and the error made by needing to estimate 
them. Write °° = Y°% — (Êy + BX) = By + BX + 2 — (By + BX) = 
us — [ (By — Bo) + (Êi — Bi) X°]. Thus var(a*) = var(u°®) + var(By + BX?) 


(Exercise 4.15). The second term in this final expression is the contribution of the estimation 


error to the out-of-sample prediction error. When the sample size is large, the first term in this 
final expression is much larger than the second term. Because the in- and out-of-sample obser- 
vations are from the same population, var (u°) = var(u;) = 07, so the standard deviation 
of û°® is estimated by the SER. 


Regression with a Single 
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5.1 


Regressor: Hypothesis Tests 
and Confidence Intervals 


his chapter continues the treatment of linear regression with a single regressor. 

Chapter 4 explained how the OLS estimator By of the slope coefficient £; differs 
from one sample to the next—that is, how Bi has a sampling distribution. In this chap- 
ter, we show how knowledge of this sampling distribution can be used to make state- 
ments about £; that accurately summarize the sampling uncertainty. The starting point 
is the standard error of the OLS estimator, which measures the spread of the sampling 
distribution of bi. Section 5.1 provides an expression for this standard error (and for 
the standard error of the OLS estimator of the intercept) and then shows how to use Bi 
and its standard error to test hypotheses. Section 5.2 explains how to construct confi- 
dence intervals for 64. Section 5.3 takes up the special case of a binary regressor. 

Sections 5.1 through 5.3 assume that the three least squares assumptions of Key 
Concept 4.3 hold. If, in addition, some stronger technical conditions hold, then some 
stronger results can be derived regarding the distribution of the OLS estimator. One of 
these stronger conditions is that the errors are homoskedastic, a concept introduced 
in Section 5.4. Section 5.5 presents the Gauss-Markov theorem, which states that, 
under certain conditions, OLS is efficient (has the smallest variance) among a certain 
class of estimators. Section 5.6 discusses the distribution of the OLS estimator when 
the population distribution of the regression errors is normal. 


Testing Hypotheses About One 
of the Regression Coefficients 


Your client, the superintendent, calls you with a problem. She has an angry taxpayer 
in her office who asserts that cutting class size will not help boost test scores, so hiring 
more teachers is a waste of money. Class size, the taxpayer claims, has no effect on 
test scores. 

The taxpayer’s claim can be restated in the language of regression analysis: The 
taxpayer is asserting that the true causal effect on test scores of a change in class size 
is 0; that is, BCiassSize = 0. 

You already provided the superintendent with an estimate of Beygsssize USINg your 
sample of 420 observations on California school districts, under the assumption that 
the least squares assumptions of Key Concept 4.3 hold. Is there, the superintendent 
asks, evidence in your data this slope is nonzero? Can you reject the taxpayer’s 
hypothesis that Bejgsssize = 0, or should you accept it, at least tentatively pending 
further new evidence? 
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General Form of the t-Statistic 


5.1 


In general, the t-statistic has the form 


estimator — hypothesized value 


= ; Sl 
standard error of the estimator GD 


This section discusses tests of hypotheses about the population coefficients £p 
and ß;ı. We start by discussing two-sided tests of 6 in detail, then turn to one-sided 
tests and to tests of hypotheses regarding the intercept Bp. 


Two-Sided Hypotheses Concerning ß; 


The general approach to testing hypotheses about the coefficient 6, is the same as to 
testing hypotheses about the population mean, so we begin with a brief review. 


Testing hypotheses about the population mean. Recall from Section 3.2 that the 
null hypothesis that the mean of Y is a specific value uyo can be written as 
H: E(Y) = uyo, and the two-sided alternative is H: E(Y) # uyo. 

The test of the null hypothesis Hp against the two-sided alternative proceeds 
as in the three steps summarized in Key Concept 3.6. The first is to compute the 
standard error of Y, SE(Y), which is an estimator of the standard deviation of the 
sampling distribution of Y. The second step is to compute the t-statistic, which has 
the general form given in Key Concept 5.1; applied here, the t-statistic is 
t= (Y — pyo)/SE(Y). 

The third step is to compute the p-value, which is the smallest significance level at 
which the null hypothesis could be rejected, based on the test statistic actually observed; 
equivalently, the p-value is the probability of obtaining a statistic, by random sampling 
variation, at least as different from the null hypothesis value as is the statistic actually 
observed, assuming that the null hypothesis is correct (Key Concept 3.5). Because 
the t-statistic has a standard normal distribution in large samples under the null 
hypothesis, the p-value for a two-sided hypothesis test is 2@(—|r““|) , where r““ is the 
value of the t-statistic actually computed and © is the cumulative standard normal 
distribution tabulated in Appendix Table 1. Alternatively, the third step can be 
replaced by simply comparing the t-statistic to the critical value appropriate for the 
test with the desired significance level. For example, a two-sided test with a 5% 
t““| > 1.96. In this case, the 
population mean is said to be statistically significantly different from the hypothesized 


significance level would reject the null hypothesis if 


value at the 5% significance level. 
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Testing hypotheses about the slope ßı. At a theoretical level, the critical feature 
justifying the foregoing testing procedure for the population mean is that, in large 
samples, the sampling distribution of Y is approximately normal. Because B, also has 
a normal sampling distribution in large samples, hypotheses about the true value of 
the slope £; can be tested using the same general approach. 

The null and alternative hypotheses need to be stated precisely before they can 
be tested. The angry taxpayer’s hypothesis is that BciassSize = 0. More generally, under 
the null hypothesis the true population coefficient 6, takes on some specific value, 
B19. Under the two-sided alternative, 6, does not equal £; o. That is, the null hypothesis 
and the two-sided alternative hypothesis are 


Ho: By = Bio vs. Hi: B1 # Pio (two-sided alternative). (5.2) 


To test the null hypothesis Hp, we follow the same three steps as for the population 
mean. 

The first step is to compute the standard error of ĝi, SE ( ĝi) . The standard error 
of Bi is an estimator of Tgp the standard deviation of the sampling distribution of Bi. 


Specifically, 
a ^2 
SE(B) = V 7b» (5.3) 
where 
1 xX Y)\2n2 
2 (X; z X) Ui 
KI 1 i = 24 
eR ae “Tio... «1. (5.4) 
So x)? 
nzi 


The estimator of the variance in Equation (5.4) is discussed in Appendix 5.1. Although 
the formula for © A, is complicated, in applications the standard error is computed by 
regression software so that it is easy to use in practice. 

The second step is to compute the t-statistic, 


"e Bio 


5.5 
SE(Bi) a 


The third step is to compute the p-value, the probability of observing a value of 
B, at least as different from Bı as the estimate actually computed ( Gach), assuming 
that the null hypothesis is correct. Stated mathematically, 


p-value = Pry [|Â — B| > |i — Biol] 


Êi — Bio 
SE( Bi) 


Be — Bio 
SE( Bi) 


= Pry Ae | = Pry (lt) > l), (5.6) 
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Testing the Hypothesis 8, = 610 
Against the Alternative 61 # 6; 5.2 


1. Compute the standard error of ĝi, SE (Ĝi) [Equation (5.3)]. 
2. Compute the t-statistic [Equation (5.5)]. 


3. Compute the p-value [Equation (5.7)]. Reject the hypothesis at the 5% sig- 
nificance level if the p-value is less than 0.05 or, equivalently, if | r““| > 1.96. 


The standard error and (typically) the t-statistic and p-value testing B, = 0 are 
computed automatically by regression software. 


where Pry, denotes the probability computed under the null hypothesis, the second 
equality follows by dividing by SE (Êi) „and ¢““ is the value of the t-statistic actually 
computed. Because Bi is approximately normally distributed in large samples, under 
the null hypothesis the t-statistic is approximately distributed as a standard normal 
random variable, so in large samples 


p-value = Pr(|Z| > |r|) = 2@(-|t*“|). (5.7) 


A p-value of less than 5% provides evidence against the null hypothesis in the sense 
that, under the null hypothesis, the probability of obtaining a value of B, at least as 
far from the null as that actually observed is less than 5%. If so, the null hypothesis 
is rejected at the 5% significance level. 

Alternatively, the hypothesis can be tested at the 5% significance level simply 
by comparing the absolute value of the t-statistic to 1.96, the critical value for a two- 
sided test, and rejecting the null hypothesis at the 5% level if |r*| > 1.96. 

These steps are summarized in Key Concept 5.2. 


Reporting regression equations and application to test scores. The OLS regression 
of the test score against the student-teacher ratio, reported in Equation (4.9), yielded 
Bo = 698.9 and Bi = —2.28.The standard errors of these estimates are SE ( Bo) = 104 
and SE(B,) = 0.52. 

Because of the importance of the standard errors, by convention they are 
included when reporting the estimated OLS coefficients. One compact way to report 
the standard errors is to place them in parentheses below the respective coefficients 
of the OLS regression line: 


ee 
TestScore = 698.9 — 2.28 x STR, R? = 0.051, SER = 18.6. (5.8) 
(10.4) (0.52) 


Equation (5.8) also reports the regression R? and the standard error of the regression 
(SER) following the estimated regression line. Thus Equation (5.8) provides the esti- 
mated regression line, estimates of the sampling uncertainty of the slope and the 
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| FIGURE5.1 | Calculating the p-Value of a Two-Sided Test When t° = — 4.38 


The p-value of a two-sided 
test is the probability that 
|Z| > |tt], where Zisa 
standard normal random 
variable and t°“ is the value 
of the t-statistic calculated 
from the sample. When 


tet = —4.38, the p-value is 
only 0.00001. 
The p-value is the area 
to the left of -4.38 
+ 
the area to the right of +4.38. 
XM 
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intercept (the standard errors), and two measures of the fit of this regression line (the 
R? and the SER). This is a common format for reporting a single regression equation, 
and it will be used throughout the rest of this text. 

Suppose you wish to test the null hypothesis that the slope £; is 0 in the popula- 
tion counterpart of Equation (5.8) at the 5% significance level. To do so, construct 
the t-statistic, and compare its absolute value to 1.96, the 5% (two-sided) critical 
value taken from the standard normal distribution. The t-statistic is constructed by 
substituting the hypothesized value of 6; under the null hypothesis (0), the estimated 
slope, and its standard error from Equation (5.8) into the general formula in Equa- 
tion (5.5); the result is f°’ = (—2.280) /0.52 = —4.38. The absolute value of this 
t-statistic exceeds the 5% two-sided critical value of 1.96, so the null hypothesis is 
rejected in favor of the two-sided alternative at the 5% significance level. 

Alternatively, we can compute the p-value associated with t“* = —4.38. This 
probability is the area in the tails of the standard normal distribution, as shown in 
Figure 5.1. This probability is extremely small, approximately 0.00001, or 0.001%. 
That is, if the null hypothesis Beygsssize = 0 is true, the probability of obtaining a value 
of Ĝĝ; as far from the null as the value we actually obtained is extremely small, less 
than 0.001%. Because this event is so unlikely, it is reasonable to conclude that the 
null hypothesis is false. 


One-Sided Hypotheses Concerning ßı 


The discussion so far has focused on testing the hypothesis that 64 = 61 against the 
hypothesis that B, ~ f,9. This is a two-sided hypothesis test because, under the 
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alternative, B, could be either larger or smaller than £; o. Sometimes, however, it is 
appropriate to use a one-sided hypothesis test. For example, in the student-teacher 
ratio/test score problem, many people think that smaller classes provide a better 
learning environment. Under that hypothesis, 64 is negative: Smaller classes lead to 
higher scores. It might make sense therefore to test the null hypothesis that 6, = 0 
(no effect) against the one-sided alternative that B, < 0. 

For a one-sided test, the null hypothesis and the one-sided alternative hypothesis are 


Ho: By = Bio Vs. Hi: By < Bio (one-sided alternative), (5.9) 


where 9 is the value of 6, under the null (0 in the student-teacher ratio example) 
and the alternative is that £; is less than £; o. If the alternative is that £; is greater than 
Bo, the inequality in Equation (5.9) is reversed. 

Because the null hypothesis is the same for a one- and a two-sided hypothesis 
test, the construction of the t-statistic is the same. The only difference between a one- 
and a two-sided hypothesis test is how you interpret the t-statistic. For the one-sided 
alternative in Equation (5.9), the null hypothesis is rejected against the one-sided 
alternative for large negative values, but not large positive values, of the t-statistic: 
Instead of rejecting if |r““| > 1.96, the hypothesis is rejected at the 5% significance 
level if 1° < —1.64. 

The p-value for a one-sided test is obtained from the cumulative standard normal 
distribution as 


p-value = Pr(Z < t“) = ®(t*") (p-value, one-sided left-tail test). (5.10) 


If the alternative hypothesis is that £; is greater than £; o, the inequalities in Equa- 
tions (5.9) and (5.10) are reversed, so the p-value is the right-tail probability, 
Pr(Z > t). 


When should a one-sided test be used? In practice, one-sided alternative hypothe- 
ses should be used only when there is a clear reason for doing so. This reason could 
come from economic theory, prior empirical evidence, or both. However, even if it 
initially seems that the relevant alternative is one-sided, upon reflection this might 
not necessarily be so. A newly formulated drug undergoing clinical trials actually 
could prove harmful because of previously unrecognized side effects. In the class size 
example, we are reminded of the graduation joke that a university’s secret of success 
is to admit talented students and then make sure that the faculty stays out of their 
way and does as little damage as possible. In practice, such ambiguity often leads 
econometricians to use two-sided tests. 


Application to test scores. The t-statistic testing the hypothesis that there is no effect 
of class size on test scores [so 61o = 0 in Equation (5.9)] is £“ = —4.38. This value 
is less than —2.33 (the critical value for a one-sided test with a 1% significance level), 
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so the null hypothesis is rejected against the one-sided alternative at the 1% level. In 
fact, the p-value is less than 0.0006%. Based on these data, you can reject the angry 
taxpayer’s assertion that the negative estimate of the slope arose purely because of 
random sampling variation at the 1% significance level. 


Testing Hypotheses About the Intercept Bo 


This discussion has focused on testing hypotheses about the slope 64. Occasionally, 
however, the hypothesis concerns the intercept Bp. The null hypothesis concerning 
the intercept and the two-sided alternative are 


Ho: Bo = Boo vs. Hy: Bo # Bo (two-sided alternative ). (5.11) 


The general approach to testing this null hypothesis consists of the three steps in 
Key Concept 5.2 applied to fp (the formula for the standard error of By is given in 
Appendix 5.1). If the alternative is one-sided, this approach is modified as was 
discussed in the previous subsection for hypotheses about the slope. 

Hypothesis tests are useful if you have a specific null hypothesis in mind (as did 
our angry taxpayer). Being able to accept or reject this null hypothesis based on the 
statistical evidence provides a powerful tool for coping with the uncertainty inherent 
in using a sample to learn about the population. Yet there are many times that no 
single hypothesis about a regression coefficient is dominant, and instead one would 
like to know a range of values of the coefficient that are consistent with the data. This 
calls for constructing a confidence interval. 


Confidence Intervals for a Regression 
Coefficient 


Because any statistical estimate of the slope 6, necessarily has sampling uncertainty, 
we cannot determine the true value of 8; exactly from a sample of data. It is possible, 
however, to use the OLS estimator and its standard error to construct a confidence 
interval for the slope 6; or for the intercept Bp. 


Confidence interval for 8. Recall from the discussion of confidence intervals in 
Section 3.3 that a 95% confidence interval for 6; has two equivalent definitions. First, 
it is the set of values that cannot be rejected using a two-sided hypothesis test with a 
5% significance level. Second, it is an interval that has a 95% probability of contain- 
ing the true value of 64; that is, in 95% of possible samples that might be drawn, the 
confidence interval will contain the true value of 64. Because this interval contains 
the true value in 95% of all samples, it is said to have a confidence level of 95%. 
The reason these two definitions are equivalent is as follows. A hypothesis test 
with a 5% significance level will, by definition, reject the true value of 6, in only 5% 
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Confidence Interval for B, 


33 


A 95% two-sided confidence interval for & is an interval that contains the true 
value of 6, with a 95% probability; that is, it contains the true value of £; in 95% 
of all possible randomly drawn samples. Equivalently, it is the set of values of 64 
that cannot be rejected by a5% two-sided hypothesis test. When the sample size 
is large, it is constructed as 


95% confidence interval for B, = [B, = 1.96SE(;) ; Ê, F 1.96SE(B;)). (5.12) 


of all possible samples; that is, in 95% of all possible samples, the true value of 6, will 
not be rejected. Because the 95% confidence interval (as defined in the first defini- 
tion) is the set of all values of 6, that are not rejected at the 5% significance level, it 
follows that the true value of 6, will be contained in the confidence interval in 95% 
of all possible samples. 

As in the case of a confidence interval for the population mean (Section 3.3), in 
principle a 95% confidence interval can be computed by testing all possible values 
of 6 (that is, testing the null hypothesis B, = Bo for all values of £; o) at the 5% 
significance level using the t-statistic. The 95% confidence interval is then the 
collection of all the values of 8; that are not rejected. But constructing the t-statistic 
for all values of 8, would take forever. 

An easier way to construct the confidence interval is to note that the t-statistic 
will reject the hypothesized value B, 9 whenever 61o is outside the range 
Êi + 1.96SE ( Bi ). This implies that the 95% confidence interval for £; is the interval 
[Bi a 1.96SE (By ), Èi + 1.96SE( ĝ,) ]. This argument parallels the argument used to 
develop a confidence interval for the population mean. 

The construction of a confidence interval for 64 is summarized as Key Concept 5.3. 


Confidence interval for By. A 95% confidence interval for 6p is constructed as in 
Key Concept 5.3, with By and SE(p) replacing B, and SE(;). 


Application to test scores. The OLS regression of the test score against the student- 
teacher ratio, reported in Equation (5.8), yielded B, = —2.28 and SE(B,) = 0.52. 
The 95% two-sided confidence interval for B, is {—2.28 + 1.96 x 0.52}, or 
—3.30 = B, = —1.26.The value 6, = 0 is not contained in this confidence interval, 
so (as we knew already from Section 5.1) the hypothesis 6, = 0 can be rejected at 
the 5% significance level. 


Confidence intervals for predicted effects of changing X. The 95% confidence inter- 
val for 6, can be used to construct a 95% confidence interval for the predicted effect 
of a general change in X. 
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ee 


Consider changing X by a given amount, Ax. The expected change in Y associated 
with this change in X is B, Ax. The population slope #; is unknown, but because we 
can construct a confidence interval for 64, we can construct a confidence interval for 
the expected effect B,Ax. Because one end of a 95% confidence interval for B; 
is Bi — 1.96SE (Êi), the predicted effect of the change Ax using this estimate of B, 
is [Bi = 1.96SE(,) ] X Ax. The other end of the confidence interval is 
B+ 1.96SE(B;), and the predicted effect of the change using that estimate is 
[Êi + 1.96SE(B,)] X Ax. Thus a 95% confidence interval for the effect of changing 
X by the amount Ax can be expressed as 


95% confidence interval for B, Ax 


= [(B, — 1.96SE(B;)) Ax, (Êi + 1.96SE(; )) Ax]. (5.13) 


For example, our hypothetical superintendent is contemplating reducing the student- 
teacher ratio by 2. Because the 95% confidence interval for 6 is [ —3.30, —1.26], the 
effect of reducing the student-teacher ratio by 2 could be as great as 
—3.30 X (—2) = 6.60 or as little as —1.26 xX (—2) = 2.52. Thus decreasing the 
student-teacher ratio by 2 is estimated to increase test scores by between 2.52 and 
6.60 points, with a 95% confidence level. 


Regression When X Is a Binary Variable 


The discussion so far has focused on the case that the regressor is a continuous 
variable. Regression analysis can also be used when the regressor is binary —that 
is, when it takes on only two values, 0 and 1. For example, X might be a worker’s 
sex (=1if female, = Oif male), whether a school district is urban or rural 
(= lif urban, = Oif rural), or whether the district’s class size is small or large 
(= 1if small, = 0 if large). A binary variable is also called an indicator variable or 
sometimes a dummy variable. 


Interpretation of the Regression Coefficients 


The mechanics of regression with a binary regressor are the same as if it is continu- 
ous. The interpretation of 6,, however, is different, and it turns out that regression 
with a binary variable is equivalent to performing a difference of means analysis, as 
described in Section 3.4. 

To see this, suppose you have a variable D; that equals either 0 or 1, depending 
on whether the student-teacher ratio is less than 20: 


__ {if the student-teacher ratio in i" district < 20 


= : 5.14 
n if the student-teacher ratio in i" district = 20 ( ) 


i 
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The population regression model with D; as the regressor is 
Y; = Bo + B,D; + ui = 1,..., A. (5.15) 


This is the same as the regression model with the continuous regressor X; except 
that now the regressor is the binary variable D;. Because D; is not continuous, it is not 
useful to think of £; as a slope; indeed, because D; can take on only two values, there 
is no “line,” so it makes no sense to talk about a slope. Thus we will not refer to 6, as 
the slope in Equation (5.15); instead, we will simply refer to £; as the coefficient 
multiplying D; in this regression or, more compactly, the coefficient on D;. 

If 6, in Equation (5.15) is not a slope, what is it? The best way to interpret B and 
Bı in a regression with a binary regressor is to consider, one at a time, the two possible 
cases, D; = 0 and D; = 1. If the student-teacher ratio is high, then D; = 0, and Equa- 
tion (5.15) becomes 


Y; = b tu; (D; = 0). (5.16) 


Because E(u;|D;) = 0, the conditional expectation of Y, when D; = 0 is 
E(Y;| D; = 0) = Bo; that is, By is the population mean value of test scores when the 
student-teacher ratio is high. Similarly, when D; = 1, 


Y; = Bo + By + Ui (D; = 1). (5.17) 


Thus, when D; = 1, E(Y,;| D; = 1) = Bo + Bi; that is, By + Bı is the population 
mean value of test scores when the student-teacher ratio is low. 

Because fp + 6 is the population mean of Y, when D; = 1 and fp is the 
population mean of Y; when D; = 0, the difference (By + 61) — Bo = , is the dif- 
ference between these two means. In other words, 6; is the difference between the 
conditional expectation of Y,;when D; = 1 and when D; = 0,or 6, = E(Y;| D; = 1) - 
E(Y;| D; = 0). In the test score example, £; is the difference between the mean test 
score in districts with low student-teacher ratios and the mean test score in districts 
with high student-teacher ratios. 

Because , is the difference in the population means, it makes sense that the 
OLS estimator 6; is the difference between the sample averages of Y; in the two 
groups, and, in fact, this is the case. 


Hypothesis tests and confidence intervals. If the two population means are the 
same, then £; in Equation (5.15) is 0. Thus the null hypothesis that the two population 
means are the same can be tested against the alternative hypothesis that they differ 
by testing the null hypothesis 6, = 0 against the alternative B, # 0.This hypothesis 
can be tested using the procedure outlined in Section 5.1. Specifically, the null hypoth- 
esis can be rejected at the 5% level against the two-sided alternative when the OLS 
t-statistic t = Â| /SE(B,) exceeds 1.96 in absolute value. Similarly, a 95% confidence 
interval for B,,constructed as ĝi + 1.96SE ( Bi) as described in Section 5.2, provides 
a 95% confidence interval for the difference between the two population means. 
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Application to test scores. As an example, a regression of the test score against the 
student-teacher ratio binary variable D defined in Equation (5.14) estimated by OLS 
using the 420 observations in Figure 4.2 yields 


— “= 
TestScore = 650.0 + 7.4D, R? = 0.037, SER = 18.7, 
(13) (1.8) (5.18) 


where the standard errors of the OLS estimates of the coefficients By and , are given 
in parentheses below the OLS estimates. Thus the average test score for the sub- 
sample with student-teacher ratios greater than or equal to 20 (that is, for which 
D = 0) is 650.0, and the average test score for the subsample with student-teacher 
ratios less than 20 (so D = 1) is 650.0 + 7.4 = 657.4. The difference between the 
sample average test scores for the two groups is 74. This is the OLS estimate of 64, 
the coefficient on the student-teacher ratio binary variable D. 

Is the difference in the population mean test scores in the two groups statistically 
significantly different from 0 at the 5% level? To find out, construct the t-statistic on 
Bı: t = 7.4/1.8 = 4.04. This value exceeds 1.96 in absolute value, so the hypothesis 
that the population mean test scores in districts with high and low student-teacher 
ratios are the same can be rejected at the 5% significance level. 

The OLS estimator and its standard error can be used to construct a 95% confi- 
dence interval for the true difference in means. This is 7.4 + 1.96 X 1.8 = (3.9, 10.9). 
This confidence interval excludes £, = 0, so that (as we know from the previous 
paragraph) the hypothesis 6, = 0 can be rejected at the 5% significance level. 


Heteroskedasticity and Homoskedasticity 


Our only assumption about the distribution of u; conditional on_X; is that it has a mean 
of 0 (the first least squares assumption). If, furthermore, the variance of this conditional 
distribution does not depend on X, then the errors are said to be homoskedastic. This 
section discusses homoskedasticity, its theoretical implications, the simplified formulas 
for the standard errors of the OLS estimators that arise if the errors are homoskedastic, 
and the risks you run if you use these simplified formulas in practice. 


What Are Heteroskedasticity and Homoskedasticity? 


Definitions of heteroskedasticity and homoskedasticity. The error term u; is 
homoskedastic if the variance of the conditional distribution of u; given X; is constant 
fori = 1,...,n and in particular does not depend on X;. Otherwise, the error term 
is heteroskedastic. 

Homoskedasticity and heteroskedasticity are illustrated in Figure 5.2. The 
distribution of the errors u;is shown for various values of x. Because this distribution 
applies specifically for the indicated value of x, this is the conditional distribution of 
u; given X; = x; by the first least squares assumption, this distribution has mean 0 for 
all x. In Figure 5.2(a), all these conditional distributions have the same spread; more 
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(b) The errors are heteroskedastic Student-teacher ratio 


precisely, the variance of these distributions is the same for the various values of x. 
That is, in Figure 5.2(a), the conditional variance of u; given X; = x does not depend 
on x, so the errors illustrated in Figure 5.2(a) are homoskedastic. 

In contrast, Figure 5.2(b) illustrates a case in which the conditional distribution 
of u; spreads out as x increases. For small values of x, this distribution is tight, but for 
larger values of x, it has a greater spread. Thus in Figure 5.2 the variance of u; given 
X; = x increases with x, so that the errors in Figure 5.2 are heteroskedastic. 

The definitions of heteroskedasticity and homoskedasticity are summarized in 
Key Concept 5.4. 


Example. These terms are a mouthful, and the definitions might seem abstract.To help 
clarify them with an example, we digress from the student-teacher ratio/test score 
problem and instead return to the example of variation in household earnings by socio- 
economic class and level of education considered in the box in Chapter 3 titled “Social 
Class or Education? Childhood Circumstances and Adult Earnings Revisited” Let 
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Heteroskedasticity and Homoskedasticity 


5.4 


The error term uw; is homoskedastic if the variance of the conditional distribution 
of u; given X; var(u;|X; = x),is constant for i = 1, ... ,n and in particular does 
not depend on x. Otherwise, the error term is heteroskedastic. 


HIGHER; be a binary variable that equals 1 for people whose father’s NS-SEC group- 
ing was higher and equals 0 if this grouping was routine. The binary variable regression 
model relating a college graduate’s earnings to his or her gender is 


Earnings; = By + BHIGHER,; + u; (5.19) 


fori = 1,...,m. Because the regressor is binary, 8; is the difference in the popula- 
tion means of the two groups—in this case, the difference in household mean earn- 
ings between people whose father was in a higher socioeconomic class and people 
whose father was in a lower socioeconomic class. 

The definition of homoskedasticity states that the variance of u; does not depend 
on the regressor. Here the regressor is HIGHER,, so at issue is whether the variance 
of the error term depends on HIGHER,. In other words, is the variance of the error 
term the same for people whose father’s socioeconomic classification was higher and 
for those whose father’s socioeconomic classification was lower? If so, the error is 
homoskedastic; if not, it is heteroskedastic. 

Deciding whether the variance of u; depends on HIGHER; requires thinking 
hard about what the error term actually is. In this regard, it is useful to write Equation 
(5.19) as two separate equations, one for each gender: 


Earnings; = By + u; (higher NS — SEC) and (5.20) 
Earnings; = By + B, + u; (higher NS — SEC). (5.21) 


Thus, for those whose father’s socioeconomic classification was lower, u; is the devia- 
tion of the i" such person’s household earnings from the population mean such earn- 
ings for such people (£p), and for those whose father’s socioeconomic classification was 
higher, u; is the deviation of the i such person’s household earnings from the popula- 
tion mean of such earnings for those whose father’s socioeconomic classification was 
higher (8) + 61). It follows that the statement “the variance of u; does not depend on 
HIGHER? is equivalent to the statement “the variance of earnings is the same across 
socioeconomic classifications.” In other words, in this example, the error term is homo- 
skedastic if the variance of the population distribution of earnings is the same across 
NS-SEC classifications; if these variances differ, the error term is heteroskedastic. 


Mathematical Implications of Homoskedasticity 


The OLS estimators remain unbiased and asymptotically normal. Because the least 
squares assumptions in Key Concept 4.3 place no restrictions on the conditional vari- 
ance, they apply to both the general case of heteroskedasticity and the special case 
of homoskedasticity. Therefore, the OLS estimators remain unbiased and consistent 
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even if the errors are homoskedastic. In addition, the OLS estimators have sampling 
distributions that are normal in large samples even if the errors are homoskedastic. 
Whether the errors are homoskedastic or heteroskedastic, the OLS estimator is unbi- 
ased, consistent, and asymptotically normal. 


Efficiency of the OLS estimator when the errors are homoskedastic. If the least 
squares assumptions in Key Concept 4.3 hold and the errors are homoskedastic, then 
the OLS estimators Bo and Bi are efficient among all estimators that are linear in 
Y,,..., Y, and are unbiased, conditional on Xj,... , X,,. This result, which is called 
the Gauss—Markov theorem, is discussed in Section 5.5. 


Homoskedasticity-only variance formula. If the error term is homoskedastic, then 
the formulas for the variances of By and B; in Key Concept 4.4 simplify. Consequently, 
if the errors are homoskedastic, then there is a specialized formula that can be used 
for the standard errors of Bo and ĝi. The homoskedasticity-only standard error of Ĝi, 
derived in Appendix 5.1, is SE ( Bi) = VER, where 5, is the homoskedasticity-only 
estimator of the variance of B;: 

e s 

o = = _ (homoskedasticity-only), (5.22) 
pe 6s 
i=1 


where sz is given in Equation (4.17). The homoskedasticity-only formula for the stan- 
dard error of By is given in Appendix 5.1. In the special case that X is a binary vari- 
able, the estimator of the variance of Bi under homoskedasticity (that is, the square 
of the standard error of B, under homoskedasticity) is the so-called pooled variance 
formula for the difference in means given in Equation (3.23). 

Because these alternative formulas are derived for the special case that the errors 
are homoskedastic and do not apply if the errors are heteroskedastic, they will be 
referred to as the “homoskedasticity-only” formulas for the variance and standard error 
of the OLS estimators. As the name suggests, if the errors are heteroskedastic, then the 
homoskedasticity-only standard errors are inappropriate. Specifically, if the errors are 
heteroskedastic, then the t-statistic computed using the homoskedasticity-only standard 
error does not have a standard normal distribution, even in large samples. In fact, the 
correct critical values to use for this homoskedasticity-only t-statistic depend on the 
precise nature of the heteroskedasticity, so those critical values cannot be tabulated. 
Similarly, if the errors are heteroskedastic but a confidence interval is constructed as 
+ 1.96 homoskedasticity-only standard errors, in general the probability that this inter- 
val contains the true value of the coefficient is not 95%, even in large samples. 

In contrast, because homoskedasticity is a special case of heteroskedasticity, the esti- 
mators ors and G3 of the variances of Â; and By given in Equations (5.4) and (5.26) produce 
valid statistical inferences whether the errors are heteroskedastic or homoskedastic. Thus 
hypothesis tests and confidence intervals based on those standard errors are valid whether 
or not the errors are heteroskedastic. Because the standard errors we have used so far [that 
is, those based on Equations (5.4) and (5.26)] lead to statistical inferences that are valid 
whether or not the errors are heteroskedastic, they are called heteroskedasticity-robust 
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standard errors. Because such formulas were proposed by Eicker (1967), Huber (1967), 
and White (1980), they are also referred to as Eicker-Huber—White standard errors. 


What Does This Mean in Practice? 


Which is more realistic, heteroskedasticity or homoskedasticity? The answer to this 
question depends on the application. However, the issues can be clarified by returning to 
the example of the social class gap in earnings among college graduates. Familiarity with 
how people are paid in the world around us gives some clues as to which assumption is 
more sensible. Those who are born into relatively poorer circumstances are more likely 
to remain in poorer circumstances later in life, and live in households where earnings do 
not fall into the top income bracket. This suggests that the distribution of earnings may 
be tighter for people who grew up in relative deprivation than those who grew up in more 
fortunate circumstances (see the box in Chapter 3 “Social Class or Education? Child- 
hood Circumstances and Adult Earnings Revisited”). In other words, the variance of the 
error term in Equation (5.20) for those whose father’s socioeconomic classification was 
lower is plausibly less than the variance of the error term in Equation (5.21) for those 
whose father’s socioeconomic classification was higher. Thus, the still-thin presence of 
those whose father’s socioeconomic classification was lower in high-income households 
suggests that the error term in the binary variable regression model in Equation (5.19) is 
heteroskedastic. Unless there are compelling reasons to the contrary—and we can think 
of none—it makes sense to treat the error term in this example as heteroskedastic. 

As the example of earnings illustrates, heteroskedasticity arises in many economet- 
ric applications. At a general level, economic theory rarely gives any reason to believe 
that the errors are homoskedastic. It therefore is prudent to assume that the errors 
might be heteroskedastic unless you have compelling reasons to believe otherwise. 


Practical implications. The main issue of practical relevance in this discussion is 
whether one should use heteroskedasticity-robust or homoskedasticity-only standard 
errors. In this regard, it is useful to imagine computing both, then choosing between them. 
If the homoskedasticity-only and heteroskedasticity-robust standard errors are the same, 
nothing is lost by using the heteroskedasticity-robust standard errors; if they differ, how- 
ever, then you should use the more reliable ones that allow for heteroskedasticity. The 
simplest thing, then, is always to use the heteroskedasticity-robust standard errors. 

For historical reasons, many software programs report homoskedasticity- 
only standard errors as their default setting, so it is up to the user to specify the 
option of heteroskedasticity-robust standard errors. The details of how to implement 
heteroskedasticity-robust standard errors depend on the software package you use. 

All of the empirical examples in this book employ heteroskedasticity-robust 


standard errors unless explicitly stated otherwise.' 


1 In case this book is used in conjunction with other texts, it might be helpful to note that some textbooks 
add homoskedasticity to the list of least squares assumptions. As just discussed, however, this additional 
assumption is not needed for the validity of OLS regression analysis as long as heteroskedasticity-robust 
standard errors are used. 


The Economic Value of a Year of Education: 


5.4 Heteroskedasticity and Homoskedasticity 
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Homoskedasticity or Heteroskedasticity? 


O 


education. But if the best-paying jobs mainly go to 


n average, workers with more education 


have higher earnings than workers with less 


the college educated, it might also be that the spread 
of the distribution of earnings is greater for workers 
with more education. Does the distribution of earn- 
ings spread out as education increases? 

This is an empirical question, so answering it 
requires analyzing data. Figure 5.3 is a scatterplot of 
the hourly earnings and the number of years of edu- 
cation for a sample of 2731 full-time workers in the 
United States in 2015, ages 29 and 30, with between 
8 and 18 years of education. The data come from the 
March 2016 Current Population Survey, which is 
described in Appendix 3.1. 

Figure 5.3 has two striking features. The first is that 
the mean of the distribution of earnings increases 
with the number of years of education. This increase 
is summarized by the OLS regression line, 


a 
Earnings = —12.12 + 2.37 Years Education, 


(1.36) (0.10) 


RE = Ole see = A (5.23) 


This line is plotted in Figure 5.3. The coefficient 


of 2.37 in the OLS regression line means that, on 


average, hourly earnings increase by $2.37 for each 
additional year of education. The 95% confidence 
interval for this coefficient is 2.37 + 1.96 X 0.10, or 
$2.17 to $2.57, 

The second striking feature of Figure 5.3 is that 
the spread of the distribution of earnings increases 
with the years of education. While some workers 
with many years of education have low-paying jobs, 
very few workers with low levels of education have 
high-paying jobs. This can be quantified by looking 
at the spread of the residuals around the OLS regres- 
sion line. For workers with ten years of education, 
the standard deviation of the residuals is $6.31; for 
workers with a high school diploma, this standard 
deviation is $8.54; and for workers with a college 
degree, this standard deviation increases to $13.55. 
Because these standard deviations differ for differ- 
ent levels of education, the variance of the residuals 
in the regression of Equation (5.23) depends on the 
value of the regressor (the years of education); in 
other words, the regression errors are heteroskedas- 
tic. In real-world terms, not all college graduates will 
be earning $75 per hour by the time they are 29, but 
some will, and workers with only ten years of educa- 


tion have no shot at those jobs. 


S 
(CHE Scatterplot of Hourly Earnings and Years of Education 
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The Theoretical Foundations 
of Ordinary Least Squares 


As discussed in Section 4.5, the OLS estimator is unbiased, is consistent, has a vari- 
ance that is inversely proportional to n, and has a normal sampling distribution when 
the sample size is large. In addition, under certain conditions the OLS estimator is 
more efficient than some other candidate estimators. Specifically, if the least squares 
assumptions hold and if the errors are homoskedastic, then the OLS estimator has 
the smallest variance of all conditionally unbiased estimators that are linear func- 
tions of Y;,..., Y,. This section explains and discusses this result, which is a conse- 
quence of the Gauss—Markov theorem. The section concludes with a discussion of 
alternative estimators that are more efficient than OLS when the conditions of the 
Gauss—Markov theorem do not hold. 


Linear Conditionally Unbiased Estimators and 
the Gauss-Markov Theorem 


If the three least squares assumptions in Key Concept 4.3 hold and if the error is 
homoskedastic, then the OLS estimator has the smallest variance, conditional on 
Xi... , Xn among all estimators in the class of linear conditionally unbiased estima- 
tors. In other words, the OLS estimator is the Best Linear conditionally Unbiased 
Estimator — that is, it is BLUE. This result is an extension of the result, summarized 
in Key Concept 3.3, that the sample average Y is the most efficient estimator of the 
population mean in the class of all estimators that are unbiased and are linear func- 
tions (weighted averages) of Y;,..., Y, 


Linear conditionally unbiased estimators. The class of linear conditionally unbiased 


estimators consists of all estimators of £, that are linear functions of Y,,..., Y, and 
that are unbiased, conditional on X4, . . . , X,,. That is, if 64 is a linear estimator, then 
it can be written as 

~ n ~ 

Bi = Say, (Bis linear), (5.24) 

i=l 
where the weights a,,...,a, can depend on Xj,..., X, but not on Y,,..., Y,,. The 
estimator f, is conditionally unbiased if the mean of its conditional sampling distri- 
bution given X),... , X, is B,. That is, the estimator 64 is conditionally unbiased if 
E( B;|X,..., Xn) = Bı (Bais conditionally unbiased). (5.25) 


The estimator ĝ; is a linear conditionally unbiased estimator if it can be written in 
the form of Equation (5.24) (it is linear) and if Equation (5.25) holds (it is condition- 
ally unbiased). It is shown in Appendix 5.2 that the OLS estimator is linear and 
conditionally unbiased. 


* This section is optional and is not used in later chapters. 
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The Gauss- Markov Theorem for $, 
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If the three least squares assumptions in Key Concept 4.3 hold and if errors are 
homoskedastic, then the OLS estimator Ê, is the Best (most efficient) Linear con- 
ditionally Unbiased Estimator (BLUE). 


The Gauss-Markov theorem. The Gauss-Markov theorem states that, under a set 
of conditions known as the Gauss—Markov conditions, the OLS estimator Bi has the 
smallest conditional variance given Xj,..., X, of all linear conditionally unbiased 
estimators of 8; that is, the OLS estimator is BLUE. The Gauss—Markov conditions, 
which are stated in Appendix 5.2, are implied by the three least squares assumptions 
plus the assumption that the errors are homoskedastic. Consequently, if the three 
least squares assumptions hold and the errors are homoskedastic, then OLS is BLUE. 
The Gauss—Markov theorem is stated in Key Concept 5.5 and proven in Appendix 5.2. 


Limitations of the Gauss-Markov theorem. The Gauss—Markov theorem provides 
a theoretical justification for using OLS. However, the theorem has two important 
limitations. First, its conditions might not hold in practice. In particular, if the error 
term is heteroskedastic—as it often is in economic applications —then the OLS esti- 
mator is no longer BLUE. As discussed in Section 5.4, the presence of heteroskedas- 
ticity does not pose a threat to inference based on heteroskedasticity-robust standard 
errors, but it does mean that OLS is no longer the efficient linear conditionally unbi- 
ased estimator. An alternative to OLS when there is heteroskedasticity of a known 
form, called the weighted least squares estimator, is discussed below. 

The second limitation of the Gauss—Markov theorem is that even if the condi- 
tions of the theorem hold, there are other candidate estimators that are not linear 
and conditionally unbiased; under some conditions, these other estimators are more 
efficient than OLS. 


Regression Estimators Other Than OLS 


Under certain conditions, some regression estimators are more efficient than OLS. 


The weighted least squares estimator. If the errors are heteroskedastic, then OLS 
is no longer BLUE. If the nature of the heteroskedasticity is known—specifically, if 
the conditional variance of u; given X; is known up to a constant factor of propor- 
tionality —then it is possible to construct an estimator that has a smaller variance than 
the OLS estimator. This method, called weighted least squares (WLS), weights the i” 
observation by the inverse of the square root of the conditional variance of u; given 
X; Because of this weighting, the errors in this weighted regression are homoskedas- 
tic, so OLS, when applied to the weighted data, is BLUE. Although theoretically 
elegant, the practical problem with weighted least squares is that you must know how 


196 


CHAPTER5 Regression with a Single Regressor: Hypothesis Tests and Confidence Intervals 


"30 


the conditional variance of u; depends on X;, something that is rarely known in 
econometric applications. Weighted least squares is therefore used far less frequently 
than OLS, and further discussion of WLS is deferred to Chapter 18. 


The least absolute deviations estimator. As discussed in Section 4.3, the OLS esti- 
mator can be sensitive to outliers. If extreme outliers are not rare, then other estima- 
tors can be more efficient than OLS and can produce inferences that are more 
reliable. One such estimator is the least absolute deviations (LAD) estimator, in 
which the regression coefficients By and 6, are obtained by solving a minimization 
problem like that in Equation (4.4) except that the absolute value of the prediction 
“mistake” is used instead of its square. That is, the LAD estimators of By and £, are 
the values of bọ and b; that minimize >’;-,|Y; — bo — b,X;|. The LAD estimator is 
less sensitive to large outliers in u than is OLS. 

In many economic data sets, severe outliers in u are rare, so use of the LAD 
estimator, or other estimators with reduced sensitivity to outliers, is uncommon in 
applications. Thus the treatment of linear regression throughout the remainder of 
this text focuses exclusively on least squares methods. 


Using the t-Statistic in Regression 
When the Sample Size Is Small 


When the sample size is small, the exact distribution of the t-statistic is complicated 
and depends on the unknown population distribution of the data. If, however, the 
three least squares assumptions hold, the regression errors are homoskedastic, and 
the regression errors are normally distributed, then the OLS estimator is normally 
distributed and the homoskedasticity-only t-statistic has a Student f¢ distribution. 
These five assumptions—the three least squares assumptions, that the errors are 
homoskedastic, and that the errors are normally distributed—are collectively called 
the homoskedastic normal regression assumptions. 


The t-Statistic and the Student t Distribution 


Recall from Section 2.4 that the Student ¢ distribution with m degrees of freedom is 
defined to be the distribution of Z/ V W/m, where Z is a random variable with a 
standard normal distribution, W is a random variable with a chi-squared distribution 
with m degrees of freedom, and Z and W are independent. Under the null hypothesis, 
the t-statistic computed using the homoskedasticity-only standard error can be writ- 
ten in this form. 

The details of the calculation are presented in Sections 18.4 and 19.4; the main 
ideas are as follows. The homoskedasticity-only t-statistic testing B, = fio is 
t= (Bi — Bio)/og,, where of is defined in Equation (5.22). Under the homoskedastic 


* This section is optional and is not used in later chapters. 
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normal regression assumptions, Y; has a normal distribution, conditional on 
Xi... , Xp As discussed in Section 5.5, the OLS estimator is a weighted average of 
Y,,..., Y,, where the weights depend on X, . . . , X, [see Equation (5.32) in Appen- 
dix 5.2]. Because a weighted average of independent normal random variables is 
normally distributed, Bi has a normal distribution, conditional on X;,... , X,,. Thus 
Bi — B19 has a normal distribution with mean 0 under the null hypothesis, condi- 
tional on X, . . . , X„. In addition, Sections 18.4 and 19.4 show that the (normalized) 
homoskedasticity-only variance estimator has a chi-squared distribution with n — 2 
degrees of freedom, divided by n — 2, and oR and Â are independently distributed. 
Consequently, the homoskedasticity-only t-statistic has a Student f distribution with 
n — 2 degrees of freedom. 

This result is closely related to a result discussed in Section 3.5 in the context of 
testing for the equality of the means in two samples. In that problem, if the two popu- 
lation distributions are normal with the same variance and if the f-statistic is con- 
structed using the pooled standard error formula [Equation (3.23)], then the (pooled) 
t-statistic has a Student ¢ distribution. When X is binary, the homoskedasticity-only 
standard error for ĝi simplifies to the pooled standard error formula for the difference 
of means. It follows that the result of Section 3.5 is a special case of the result that if 
the homoskedastic normal regression assumptions hold, then the homoskedasticity- 
only regression t-statistic has a Student ¢ distribution (see Exercise 5.10). 


Use of the Student t Distribution in Practice 


If the regression errors are homoskedastic and normally distributed and if the 
homoskedasticity-only t-statistic is used, then critical values should be taken from the 
Student ¢ distribution (Appendix Table 2) instead of the standard normal distribu- 
tion. Because the difference between the Student ¢ distribution and the normal 
distribution is negligible if n is moderate or large, this distinction is relevant only if 
the sample size is small. 

In econometric applications, there is rarely a reason to believe that the errors are 
homoskedastic and normally distributed. Because sample sizes typically are large, 
however, inference can proceed as described in Sections 5.1 and 5.2—that is, by first 
computing heteroskedasticity-robust standard errors and then by using the standard 
normal distribution to compute p-values, hypothesis tests, and confidence intervals. 


Conclusion 


Return for a moment to the problem of the superintendent who is considering hiring 
additional teachers to cut the student-teacher ratio. What have we learned that she 
might find useful? 

Our regression analysis, based on the 420 observations in the California test 
score data set, showed that there was a negative relationship between the student- 
teacher ratio and test scores: Districts with smaller classes have higher test scores. 


198 


CHAPTER5 Regression with a Single Regressor: Hypothesis Tests and Confidence Intervals 


The coefficient is moderately large, in a practical sense: Districts with two fewer 
students per teacher have, on average, test scores that are 4.6 points higher. This cor- 
responds to moving a district at the 50th percentile of the distribution of test scores 
to approximately the 60th percentile. 

The coefficient on the student-teacher ratio is statistically significantly different 
from 0 at the 5% significance level. The population coefficient might be 0, and we 
might simply have estimated our negative coefficient by random sampling variation. 
However, the probability of doing so (and of obtaining a t-statistic on £; as large as 
we did) purely by random variation over potential samples is exceedingly small, 
approximately 0.001%. A 95% confidence interval for B, is —3.30 = B, = —1.26. 

These results represent progress toward answering the superintendent’s 
question—yet a nagging concern remains. There is a negative relationship between 
the student-teacher ratio and test scores, but is this relationship the causal one that 
the superintendent needs to make her decision? Districts with lower student-teacher 
ratios have, on average, higher test scores. But does this mean that reducing the 
student-teacher ratio will, in fact, increase scores? 

The question of whether OLS applied to the California data estimates the causal 
effect of class size on test scores can be sharpened by returning to the least squares 
assumptions of Key Concept 4.3. The first least squares assumption requires that, 
when £, is defined to be the causal effect, the distribution of the errors has condi- 
tional mean 0. This requirement has the interpretation of, in effect, requiring X (class 
size) to be randomly assigned or as-if randomly assigned. Because the California data 
are observational, class size was not randomly assigned. So the question is: In the 
California data, is class size as-if randomly assigned, in the sense that E(u|X) = 0? 

There is, in fact, reason to worry that it might not be. Hiring more teachers, after 
all, costs money, so wealthier school districts can better afford smaller classes. But 
students at wealthier schools also have other advantages over their poorer neigh- 
bors, including better facilities, newer books, and better-paid teachers. Moreover, 
students at wealthier schools tend themselves to come from more affluent families 
and thus have other advantages not directly associated with their school. For exam- 
ple, California has a large immigrant community; these immigrants tend to be poorer 
than the overall population, and in many cases, their children are not native English 
speakers. It thus might be that our negative estimated relationship between test 
scores and the student-teacher ratio is a consequence of large classes being found 
in conjunction with many other factors that are, in fact, the real reason for the lower 
test scores. 

These other factors, or “omitted variables,” could mean that the OLS analysis 
done so far has little value to the superintendent. Indeed, it could be misleading: 
Changing the student-teacher ratio alone would not change these other factors that 
determine a child’s performance at school. To address this problem, we need a 
method that will allow us to isolate the effect on test scores of changing the student- 
teacher ratio, holding these other factors constant. That method is multiple regression 
analysis, the topic of Chapters 6 and 7 


Key Terms 199 


Summary 


1. 


Hypothesis testing for regression coefficients is analogous to hypothesis testing 
for the population mean: Use the t-statistic to calculate the p-values and either 
accept or reject the null hypothesis. Like a confidence interval for the popula- 
tion mean, a 95% confidence interval for a regression coefficient is computed 
as the estimator + 1.96 standard errors. 


2. When X is binary, the regression model can be used to estimate and test 
hypotheses about the difference between the population means of the “X = 0” 
group and the “X = 1” group. 

3. In general, the error u; is heteroskedastic; that is, the variance of u; at a given 
value of X; var(u; |X; = x), depends on x. A special case is when the error is 
homoskedastic; that is, when var(u; | X; = x) is constant. Homoskedasticity- 
only standard errors do not produce valid statistical inferences when the errors 
are heteroskedastic, but heteroskedasticity-robust standard errors do. 

4. If the three least squares assumption hold and if the regression errors are 
homoskedastic, then, as a result of the Gauss—Markov theorem, the OLS esti- 
mator is BLUE. 

5. Ifthe three least squares assumptions hold, if the regression errors are homo- 
skedastic, and if the regression errors are normally distributed, then the OLS 
t-statistic computed using homoskedasticity-only standard errors has a Student 
t distribution when the null hypothesis is true. The difference between the 
Student ¢ distribution and the normal distribution is negligible if the sample 
size is moderate or large. 

Key Terms 

null hypothesis (180) homoskedasticity-only standard 
two-sided alternative hypothesis (180) error (191) 

standard error of Ê, (180) heteroskedasticity-robust standard 
t-statistic (180) error (191) 

p-value (180) Gauss-Markov theorem (206) 
confidence interval for 6, (184) best linear unbiased estimator 
confidence level (184) (BLUE) (195) 

indicator variable (186) weighted least squares (WLS) (195) 
dummy variable (186) homoskedastic normal regression 
coefficient multiplying D; (187) assumptions (196) 

coefficient on D; (187) Gauss—Markov conditions (208) 


homoskedasticity and 


heteroskedasticity (188) 
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chapter, MyLab Economics Practice Tests and Study Plan 


Review the Concepts 


5.1 


5.3 


5.4 


Outline the procedures for computing the p-value of a two-sided test of 
Ay: wy = O using an i.i.d. set of observations Y,i = 1,...,. Outline the 
procedures for computing the p-value of a two-sided test of Hj: 64 = 0 ina 
regression model using an i.i.d. set of observations (Y, X;),i = 1,...,n. 


When are one-sided hypothesis tests constructed for estimated regression 
coefficients as opposed to two-sided hypothesis tests? When are confidence 
intervals constructed instead of hypothesis tests? 


Describe the important characteristics of the variance of the conditional dis- 
tribution of the error term in a linear regression? What are the implications 
for OLS estimation? 


What is a dummy variable or an indicator variable? Describe the differences 
in interpretation of the coefficients of a linear regression when the indepen- 
dent variable is continuous and when it is binary. Give an example of each 
case. Explain how the construction of confidence intervals and hypothesis 
tests is different when the independent variable is binary compared to when 
it is continuous. 


Exercises 


5.1 


Suppose a researcher, using data on class size (CS) and average test scores 
from 50 third-grade classes, estimates the OLS regression 


—a 
TestScore = 640.3 — 4.93 X CS, R? = 0.11, SER = 8.7. 
(23.5) (2.02) 


a. Construct a 95% confidence interval for £, the regression slope 
coefficient. 


b. Calculate the p-value for the two-sided test of the null hypothesis 0. 
Do you reject the null hypothesis at the 5% level? At the 1% level? 


5.2 


5.3 


5.4 
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c. Calculate the p-value for the two-sided test of the null hypothesis 
A: Bı = —5.0. Without doing any additional calculations, determine 
whether —5.0 is contained in the 95% confidence interval for 64. 


d. Construct a 90% confidence interval for Bp. 


Suppose that a researcher, using wage data on 200 randomly selected male 
workers and 240 female workers, estimates the OLS regression 


Wage = 10.73 + 1.78 X Male, R? = 0.09, SER = 3.8, 
(0.16) (0.29) 


where Wage is measured in dollars per hour and Male is a binary variable that 
is equal to 1 if the person is a male and 0 if the person is a female. Define the 
wage gender gap as the difference in mean earnings between men and women. 


a. What is the estimated gender gap? 


b. Is the estimated gender gap significantly different from 0? (Compute the 
p-value for testing the null hypothesis that there is no gender gap.) 


c. Construct a 95% confidence interval for the gender gap. 
d. In the sample, what is the mean wage of women? Of men? 
e. Another researcher uses these same data but regresses Wages on Female, 


a variable that is equal to 1 if the person is female and 0 if the person a 
male. What are the regression estimates calculated from this regression? 


Wage= + x Female, R =  ,SER= 


Suppose a random sample of 100 25-year-old men is selected from a popula- 
tion and their heights and weights are recorded. A regression of weight on 
height yields 


—_— “x. 
Weight = —79.24 + 4.16 X Height, R? = 0.72, SER = 12.6, 
(3.42) (42) 


where Weight is measured in pounds and Height is measured in inches. One 
man has a late growth spurt and grows 2 inches over the course of a year. 
Construct a 95% confidence interval for the person’s weight gain. 


Read the box “The Economic Value of a Year of Education: Homoskedasticity 
or Heteroskedasticity?” in Section 5.4. Use the regression reported in Equa- 
tion (5.23) to answer the following. 


a. A randomly selected 30-year-old worker reports an education level of 16 
years. What is the worker’s expected average hourly earnings? 
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b. A high school graduate (12 years of education) is contemplating going 
to a community college for a 2-year degree. How much are this worker’s 
average hourly earnings expected to increase? 


c. A high school counselor tells a student that, on average, college gradu- 
ates earn $10 per hour more than high school graduates. Is this statement 
consistent with the regression evidence? What range of values is consis- 
tent with the regression evidence? 


In the 1980s, Tennessee conducted an experiment in which kindergarten 
students were randomly assigned to “regular” and “small” classes and given 
standardized tests at the end of the year. (Regular classes contained approxi- 
mately 24 students, and small classes contained approximately 15 students.) 
Suppose, in the population, the standardized tests have a mean score of 925 
points and a standard deviation of 75 points. Let Smal/Class denote a binary 
variable equal to 1 if the student is assigned to a small class and equal to 0 
otherwise. A regression of TestScore on SmallClass yields 


TestScore = 918.0 + 13.9 X SmallClass, R? = 0.01, SER = 74.6. 
(1.6) (2.5) 


a. Do small classes improve test scores? By how much? Is the effect large? 
Explain. 


b. Is the estimated effect of class size on test scores statistically significant? 
Carry out a test at the 5% level. 


c. Construct a 99% confidence interval for the effect of SmallClass on 
TestScore. 


d. Does least squares assumption 1 plausibly hold for this regression? Explain. 
Refer to the regression described in Exercise 5.5. 


a. Do you think that the regression errors are plausibly homoskedastic? Explain. 


b. SE(B,) was computed using Equation (5.3). Suppose the regression 
errors were homoskedastic. Would this affect the validity of the confi- 
dence interval constructed in Exercise 5.5(c)? Explain. 


Suppose (Y;, X;) satisfy the least squares assumptions in Key Concept 4.3. 
A random sample of size n = 250 is drawn and yields 


Y = 5.4 + 3.2X, R? = 0.26, SER = 6.2. 
(3.1) (1.5) 
a. Test Hy: B; = Ovs. Hy: B, # Oat the 5% level. 
b. Construct a 95% confidence interval for 64. 


c. Suppose you learned that Y; and X; were independent. Would you be 
surprised? Explain. 
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d. Suppose Y; and X; are independent and many samples of size n = 250 are 
drawn, regressions estimated, and (a) and (b) answered. In what fraction 
of the samples would H, from (a) be rejected? In what fraction of samples 
would the value 8; = 0 be included in the confidence interval from (b)? 


Suppose (Y;, X;) satisfy the least squares assumptions in Key Concept 4.3 and, in 
addition, u;is N(0, 77) and is independent of X;.A sample of size n = 30 yields 


Ê = 43.2 + 61.5X, R? = 0.54, SER = 1.52, 
(10.2) (7.4) 
where the numbers in parentheses are the homoskedastic-only standard errors 
for the regression coefficients. 
a. Construct a 95% confidence interval for Bp. 
b. Test Ho: B, = 55 vs. Hy: By # 55 at the 5% level. 
c Test Hp: B, = 55 vs. Hı: By > 55 at the 5% level. 


Consider the regression model 
Yi = BX; + uj, 
where u; and X; satisfy the least squares assumptions in Key Concept 4.3. Let 


B denote an estimator of B that is constructed as B = Y/X, where Y and X 
are the sample means of Y; and X;, respectively. 


a. Show that 6 is a linear function of Y4, Y>, ... , Y„. 


b. Show that 6 is conditionally unbiased. 


Let X; denote a binary variable, and consider the regression Y; = By + 
BX; + u;. Let Yọ denote the sample mean for observations with X = 0, 
and let Y; denote the sample mean for observations with X = 1. Show that 
Êo = Yo, By + Êi = Y, and B, = Y, — Y». 

A random sample of workers contains n,, = 100 men and n,, = 150 women. 
The sample average of men’s weekly earnings [Yn = (1/Mm) E 1¥ mil 
is €565.89, and the standard deviation [s,, = VHE AY mi - Y„)?] 


is €75.62. The corresponding values for women are Y, = €502.37 and 


Sy = €53.40. Let Women denote an indicator variable that is equal to 1 for 
women and 0 for men, and suppose that all of 250 observations are used in 
the regression Y; = By + B, Women + u;. Find the OLS estimates of By and 
bı and their corresponding standard errors. 


Starting from Equation (4.20), derive the variance of Bo under homoskedasticity 
given in Equation (5.28) in Appendix 5.1. 


Suppose ( Y;, X;) satisfy the least squares assumptions in Key Concept 4.3 and, 
in addition, u; is distributed N(0, 77) and is independent of X;. 


a. Is ĝi conditionally unbiased? 


204 


CHAPTER 5 


Regression with a Single Regressor: Hypothesis Tests and Confidence Intervals 


5.14 


5.15 


b. Is Bi the best linear conditionally unbiased estimator of B,? 


c. How would your answers to (a) and (b) change if you assumed only that 
(Y;, X;) satisfied the least squares assumptions in Key Concept 4.3 and 
var(u;| X; = x) is constant? 

d. How would your answers to (a) and (b) change if you assumed only that 
(Y;, Xi) satisfied the least squares assumptions in Key Concept 4.3? 


Suppose Y; = BX; + u; where (u; X;) satisfy the Gauss—Markov conditions 
given in Equation (5.31). 


a. Derive the least squares estimator of 8, and show that it is a linear func- 
tion of Y;,..., Y, 


b. Show that the estimator is conditionally unbiased. 
c. Derive the conditional variance of the estimator. 


d. Prove that the estimator is BLUE. 


A researcher has two independent samples of observations on (Y; Xj). 
To be specific, suppose Y; denotes earnings, X; denotes years of schooling, 
and the independent samples are for men and women. Write the regression 
for men as Ymi = Bno + BniXmi + Um; and the regression for women as 
Yui = Boo + ByiXwi + Uw; Let ae denote the OLS estimator constructed 
using the sample of men, But denote the OLS estimator constructed from 
the sample of women, and SE(Bn1) and SE(By1) denote the correspond- 
ing standard errors. Show that the standard error of Bn = Bot is given by 


SE(Bma — Bua) = V[SE(Bma)]2 + [SE(Bw1) 1. 


Empirical Exercises 


(Only three empirical exercises for this chapter are given in the text, but you can find 


more on the text website, http://www.pearsonglobaleditions.com.) 


E5.1 Use the data set Earnings_and_Height described in Empirical Exercise 4.2 to 


carry out the following exercises. 


a. Run a regression of Earnings on Height. 

i. Is the estimated slope statistically significant? 

ii. Construct a 95% confidence interval for the slope coefficient. 
b. Repeat (a) for women. 
c. Repeat (a) for men. 


d. Test the null hypothesis that the effect of height on earnings is the same 
for men and women. (Hint: See Exercise 5.15.) 


E5.2 


E5.3 
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e. One explanation for the effect of height on earnings is that some 
professions require strength, which is correlated with height. Does the 
effect of height on earnings disappear when the sample is restricted to 
occupations in which strength is unlikely to be important? 


Using the data set Growth described in Empirical Exercise 4.1, but excluding 
the data for Malta, run a regression of Growth on TradeShare. 


a. Is the estimated regression slope statistically significant? That is, can you 
reject the null hypothesis Hp: B, = 0 vs. a two-sided alternative hypoth- 
esis at the 10%,5%, or 1% significance level? 


b. What is the p-value associated with the coefficient’s t-statistic? 
c. Construct a 90% confidence interval for 64. 
On the text website, http://www.pearsonglobaleditions.com, you will find the data 
file Birthweight_Smoking, which contains data for a random sample of babies 
born in Pennsylvania in 1989. The data include the baby’s birth weight together 
with various characteristics of the mother, including whether she smoked during 
the pregnancy.” A detailed description is given in Birthweight_Smoking_Descrip- 
tion, also available on the website. In this exercise, you will investigate the rela- 
tionship between birth weight and smoking during pregnancy. 
a. In the sample: 
i. What is the average value of Birthweight for all mothers? 
ii. For mothers who smoke? 
iii. For mothers who do not smoke? 


b. i. Use the data in the sample to estimate the difference in average birth 
weight for smoking and nonsmoking mothers. 
ii. What is the standard error for the estimated difference in (i)? 
iii. Construct a 95% confidence interval for the difference in the average 
birth weight for smoking and nonsmoking mothers. 


c. Runa regression of Birthweight on the binary variable Smoker. 


i. Explain how the estimated slope and intercept are related to your 
answers in parts (a) and (b). 


ii. Explain how the SE(ĝ,) is related to your answer in b(ii). 


iii. Construct a 95% confidence interval for the effect of smoking on 
birth weight. 


? These data were provided by Professors Douglas Almond (Columbia University), Ken Chay (Brown 
University), and David Lee (Princeton University) and were used in their paper “The Costs of Low Birth 
Weight,” Quarterly Journal of Economics, August 2005, 120(3): 1031-1083. 
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d. Do you think smoking is uncorrelated with other factors that cause 
low birth weight? That is, do you think that the regression error term— 
say, u;—has a conditional mean of 0 given Smoking (X;)?(You will 
investigate this further in Birthweight and Smoking exercises in later 
chapters.) 


Formulas for OLS Standard Errors 


This appendix discusses the formulas for OLS standard errors. These are first presented under 
the least squares assumptions in Key Concept 4.3, which allow for heteroskedasticity; these are 
the “heteroskedasticity-robust” standard errors. Formulas for the variance of the OLS estimators 


and the associated standard errors are then given for the special case of homoskedasticity. 


Heteroskedasticity- Robust Standard Errors 


The estimator ry defined in Equation (5.4) is obtained by replacing the population variances 
in Equation (4.19) by the corresponding sample variances, with a modification. The variance 
in the numerator of Equation (4.19) is estimated by +5 >\_,(X; — X)*ii?, where the divisor 
n — 2 (instead of n) incorporates a degrees-of-freedom adjustment to correct for downward 
bias, analogously to the degrees-of-freedom adjustment used in the definition of the SER in 
Section 4.3.The variance in the denominator is estimated by (1/n) >';_,(X; — X)? Replacing 
var[ (X; — wy)u;] and var(X;) in Equation (4.19) by these two estimators yields cary in 
Equation (5.4).The consistency of heteroskedasticity-robust standard errors is discussed in 
Section 18.3. 


The estimator of the variance of Bo is 


(5.26) 


where Ê, = 1 — (X/n>';_,X?)X;. The standard error of Bois SE(Bp) = V ê$, The reasoning 
behind the estimator êh, is the same as behind êz, and stems from replacing population expec- 
tations with sample averages. 


Homoskedasticity-Only Variances 


Under homoskedasticity, the conditional variance of u; given X;is a constant: var (u; | X;) = 07. 


If the errors are homoskedastic, the formulas in Key Concept 4.4 simplify to 
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2 Ou 
Co, = and (5.27) 
A no% 
EX 
o% = ( i ) o2. (5.28) 
noy 


To derive Equation (5.27), write the numerator in Equation (4.19) as var[ (X; — wy)u;] = 
E{(X;- ux)ui— E[(Xi- wx) il}? = E{[(X; — ux)u:]?} = E[ (X; — ux) ’u?] = El (X; - wx)? 
var (u; | X;)], where the second equality follows because E[ (X; — my)u;] = 0 (by the first 
least squares assumption) and where the final equality follows from the law of iterated 
expectations (Section 2.3). If u; is homoskedastic, then var(u;|X;) = 02, so 
E| (X; — wy)?var(u; | X;)] = oŻE[ (X; — wy)*] = 020%. The result in Equation (5.27) 
follows by substituting this expression into the numerator of Equation (4.19) and simplifying. 


A similar calculation yields Equation (5.28). 


Homoskedasticity-Only Standard Errors 


The homoskedasticity-only standard errors are obtained by substituting sample means and 
variances for the population means and variances in Equations (5.27) and (5.28) and by 
estimating the variance of u; by the square of the SER. The homoskedasticity-only estimators 


of these variances are 


2 

a Sh Ei 

Ti =——*——  (homoskedasticity-only) and (5.29) 
(X, — X)? 

i=1 
(se) 
Ti = a (homoskedasticity-only) (5.30) 

(hax)? 


where sz is given in Equation (4.17). The homoskedasticity-only standard errors are the square 


~2 ~2 
roots of T$, and T$. 


The Gauss-Markov Conditions and a Proof 
of the Gauss-Markov Theorem 


As discussed in Section 5.5, the Gauss—Markov theorem states that if the Gauss-Markov con- 
ditions hold, then the OLS estimator is the best (most efficient) conditionally linear unbiased 
estimator (is BLUE). This appendix begins by stating the Gauss—Markov conditions and show- 
ing that they are implied by the three least squares assumptions plus homoskedasticity. We 
next show that the OLS estimator is a linear conditionally unbiased estimator. Finally, we turn 


to the proof of the theorem. 
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The three Gauss—Markov conditions are 


(i) E(ui|X,..-,X,) = 0 


(ii) var(u;|X,...,X,) =02, 0<ar< 
(iii) E(ug;|X,...,X,) = 0,8 4 j, (5.31) 
where the conditions hold for i,j = 1,...,.The three conditions, respectively, state that u; has 


a conditional mean of 0, that u; has a constant variance, and that the errors are uncorrelated for 
different observations, where all these statements hold conditionally on all observed X’s 
(X,...,X,)- 

The Gauss—Markovy conditions are implied by the three least squares assumptions 
(Key Concept 4.3), plus the additional assumption that the errors are homoskedastic. 
Because the observations are i.i.d. (assumption 2), E(u;|Xj,...,X,) = E(u;|X;), and by 
assumption 1, E(u;|X;) = 0; thus condition (i) holds. Similarly, by assumption 2, 
var(u;|X),...,X,,) = var(u;|X;), and because the errors are assumed to be homoskedastic, 
var(u;|X;) = 02, which is constant. Assumption 3 (nonzero finite fourth moments) ensures 
that 0 < øf < œ, so condition (ii) holds. To show that condition (iii) is implied by the least 
squares assumptions, note that E(uju;|Xj,...,X,) = E(uju;|X;, X;) because (X; Y;) are iid. 
by assumption 2. Assumption 2 also implies that E(u;u;|X;, X) = E(u;|X;) E(u;| X) for 
i # j; because E(u;|X;) = 0 for all i, it follows that E(uju;|Xj,...,X,) = 0 for alli # j,so 
condition (iii) holds. Thus the least squares assumptions in Key Concept 4.3, plus homoskedas- 


ticity of the errors, imply the Gauss—Markov conditions in Equation (5.31). 


The OLS Estimator Bi Is a Linear Conditionally 

Unbiased Estimator 

To show that Bi is linear, first note that because X%-1(X; — X) = 0 (by the definition of X), 
A YP) = Dha(%— KY - PDH — X) = Da (AK — KY Sub- 
stituting this result into the formula for B; in Equation (4.5) yields 


= X)Y, yy 

^ >| ) _ we (Xi - X) 

n - S âY, where 4; = — (5.32) 
Xx- g? 5 (4 -xX)? 
j=1 j=1 


Because the weights â;,i = 1,...,n, in Equation (5.32) depend on Xj,...,X, but not on 


Yi, ..., Y,, the OLS estimator Bi is a linear estimator. 
Under the Gauss—Markov conditions, Bi is conditionally unbiased, and the variance of the 
conditional distribution of Bi given Xj,..., X, is 
A o 
var(B; | X,...,X,) = =<. (5.33) 
Se=2) 
i= 


The result that Ê, is conditionally unbiased was previously shown in Appendix 4.3. 
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Proof of the Gauss—Markov Theorem 


We start by deriving some facts that hold for all linear conditionally unbiased estimators—that 
is, for all estimators B, satisfying Equations (5.24) and (5.25). Substituting Y; = By + BLX) + u; 


into B, = S'}-4a,Y, and collecting terms, we have that 


n 


A= a Sa) + p( Sax) + Xau; (5.34) 
{Z1 i=1 i=1 
By the first Gauss—Markov condition, E(>7}=,au;|Xj,...,X_) = X'i=14;E (u;| X- . , Xa) = 0; 
thus taking conditional expectations of both sides of Equation (5.34) yields E(B,|X,,...,X,) = 
Bl ©ia) + By (>"=14;X;). Because B, is conditionally unbiased by assumption, it must be 
that By( >";=1a;) + B,(>=14:X;) = By, but for this equality to hold for all values of By and £1, 


it must be the case that, for B, to be conditionally unbiased, 


X a; = Oand X a;X; = 1. (5.35) 
i=1 i=1 
Under the Gauss—Markov conditions, the variance of B 1, conditional on X;,..., X,,,has a simple 


form. Substituting Equation (5.35) into Equation (5.34) yields By — B, = >'j=1aju;. Thus 
var ( Bı] Xis E Aa) = var ( Yi =14u;|X, dey es Xn) = DY j=1> j=14;4,C0V (uj,u;| Xi, ais seg) Xn); 
applying the second and third Gauss-Markov conditions, the cross terms in the double sum- 


mation vanish, and the expression for the conditional variance simplifies to 
n 
a = 2Y 2 
var( Bil Xi.. Xa) = o7 Dai. (5.36) 
i=1 


Note that Equations (5.35) and (5.36) apply to B _ with weights a; = â; given in Equation (5.32). 
We now show that the two restrictions in Equation (5.35) and the expression for the 
conditional variance in Equation (5.36) imply that the conditional variance of By exceeds the 
conditional variance of Ê, unless B, = Ê. Let a; = â; + dp so D'j=107 = D=1(G + di)? = 
57-18 + 25'-1âd; + X';-1d?. Using the definition of â; in Equation (5.32), we have that 
a. Some Sar 
i= 1 — i=] 1 


mi Sa, - xy Sa, - xy 
j= j= 


n 


(Sax == Sax.) = x(Sa = Sa) 
= i=1 i=1 i=1 i=1 =Ü 


Sa - x 
2 


where the penultimate equality follows from d; = a; — â; and the final equality follows from 
Equation (5.35) (which holds for both a; and â;). Thus 02 ";_,a? = 02>",_,@? + 
o>? = var(B, |X, ...,X,) + 02>"\_,d?; substituting this result into Equation (5.36) 
yields 


var(B,|Xj,...,X,) — var(B,|X%,...,X,) = oY d}. (5.37) 
1 
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Thus Bi has a greater conditional variance than Ê, if d; is nonzero for any i = 1, ... , n. But if 
d; = 0 for alli, then a; = â; and Bı = Ê. which proves that OLS is BLUE. 


The Gauss-Markov Theorem When X Is Nonrandom 


With a minor change in interpretation, the Gauss-Markov theorem also applies to nonrandom 
regressors; that is, it applies to regressors that do not change their values over repeated sam- 
ples. Specifically, if the second least squares assumption is replaced by the assumption that 
Xis... , X, are nonrandom (fixed over repeated samples) and w,...,u,, are i.i.d., then the 
foregoing statement and proof of the Gauss—Markov theorem apply directly, except that all of 
the “conditional on X,..., X,” statements are unnecessary because X),..., X, take on the 


same values from one sample to the next. 


The Sample Average Is the Efficient Linear 
Estimator of E(Y) 


An implication of the Gauss—Markov theorem is that the sample average, Y, is the most effi- 
cient linear estimator of E(Y;) when Yj,..., Y, are i.i.d. To see this, consider the case of 
regression without an “X,” so that the only regressor is the constant regressor Xp; = 1. Then 
the OLS estimator Bo = Y. It follows that, under the Gauss—Markov assumptions, Y is BLUE. 
Note that the Gauss—Markov requirement that the error be homoskedastic is automatically 
satisfied in this case because there is no regressor, so it follows that Y is BLUE if ¥,..., Y, 


are i.i.d. This result was stated previously in Key Concept 3.3. 


Linear Regression 


6 


6.1 


with Multiple Regressors 


hapter 5 ended on a worried note. Although school districts with lower 
Ges ratios tend to have higher test scores in the California data set, 
perhaps students from districts with small classes have other advantages that help 
them perform well on standardized tests. Could this have produced a misleading 
estimate of the causal effect of class size on test scores, and, if so, what can be done? 

Omitted factors, such as student characteristics, can, in fact, make the ordinary 
least squares (OLS) estimator of the effect of class size on test scores misleading or, 
more precisely, biased. This chapter explains this “omitted variable bias” and intro- 
duces multiple regression, a method that can eliminate omitted variable bias. The key 
idea of multiple regression is that if we have data on these omitted variables, then we 
can include them as additional regressors and thereby estimate the causal effect of 
one regressor (the student-teacher ratio) while holding constant the other variables 
(such as student characteristics). 

Alternatively, if one is interested not in causal inference but in prediction, the 
multiple regression model makes it possible to use multiple variables as regressors—that 
is, Multiple predictors—to improve upon predictions made using a single regressor. 

This chapter explains how to estimate the coefficients of the multiple linear 
regression model. Many aspects of multiple regression parallel those of regression 
with a single regressor, studied in Chapters 4 and 5. The coefficients of the multiple 
regression model can be estimated from data using OLS; the OLS estimators in 
multiple regression are random variables because they depend on data from a random 
sample; and in large samples, the sampling distributions of the OLS estimators are 
approximately normal. 


Omitted Variable Bias 


By focusing only on the student-teacher ratio, the empirical analysis in Chapters 4 
and 5 ignored some potentially important determinants of test scores by collecting 
their influences in the regression error term. These omitted factors include school 
characteristics, such as teacher quality and computer usage, and student characteristics, 
such as family background. We begin by considering an omitted student characteris- 
tic that is particularly relevant in California because of its large immigrant popula- 
tion: the prevalence in the school district of students who are still learning English. 
By ignoring the percentage of English learners in the district, the OLS estimator 
of the effect on test scores of the student—teacher ratio could be biased; that is, the 
mean of the sampling distribution of the OLS estimator might not equal the true causal 
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effect on test scores of a unit change in the student-teacher ratio. Here is the reasoning. 
Students who are still learning English might perform worse on standardized tests than 
native English speakers. If districts with large classes also have many students still 
learning English, then the OLS regression of test scores on the student-teacher ratio 
could erroneously find a correlation and produce a large estimated coefficient, when 
in fact the true causal effect of cutting class sizes on test scores is small, even zero. 
Accordingly, based on the analysis of Chapters 4 and 5, the superintendent might hire 
enough new teachers to reduce the student-teacher ratio by 2, but her hoped-for 
improvement in test scores will fail to materialize if the true coefficient is small or zero. 

A look at the California data lends credence to this concern. The correlation 
between the student-teacher ratio and the percentage of English learners (students 
who are not native English speakers and who have not yet mastered English) in the 
district is 0.19. This small but positive correlation suggests that districts with more 
English learners tend to have a higher student-teacher ratio (larger classes). If the 
student-teacher ratio were unrelated to the percentage of English learners, then it 
would be safe to ignore English proficiency in the regression of test scores against 
the student-teacher ratio. But because the student-teacher ratio and the percentage 
of English learners are correlated, it is possible that the OLS coefficient in the regres- 
sion of test scores on the student-teacher ratio reflects that influence. 


Definition of Omitted Variable Bias 


If the regressor (the student-teacher ratio) is correlated with a variable that has been omit- 
ted from the analysis (the percentage of English learners) and that determines, in part, the 
dependent variable (test scores), then the OLS estimator will have omitted variable bias. 
Omitted variable bias occurs when two conditions are true: (1) the omitted variable 
is correlated with the included regressor and (2) the omitted variable is a determinant of 
the dependent variable. To illustrate these conditions, consider three examples of vari- 
ables that are omitted from the regression of test scores on the student-teacher ratio. 


Example 1: Percentage of English learners. Because the percentage of English 
learners is correlated with the student-teacher ratio, the first condition for omitted 
variable bias holds. It is plausible that students who are still learning English will do 
worse on standardized tests than native English speakers, in which case the percent- 
age of English learners is a determinant of test scores and the second condition for 
omitted variable bias holds. Thus the OLS estimator in the regression of test scores on 
the student-teacher ratio could incorrectly reflect the influence of the omitted variable, 
the percentage of English learners. That is, omitting the percentage of English learners 
may introduce omitted variable bias. 


Example 2: Time of day of the test. Another variable omitted from the analysis is 
the time of day that the test was administered. For this omitted variable, it is plausible 
that the first condition for omitted variable bias does not hold but that the second 
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Omitted Variable Bias in Regression 
with a Single Regressor 6.1 


Omitted variable bias is the bias in the OLS estimator of the causal effect of X 
on Y that arises when the regressor, X, is correlated with an omitted variable. For 
omitted variable bias to occur, two conditions must be true: 


1. X is correlated with the omitted variable. 


2. The omitted variable is a determinant of the dependent variable, Y. 


condition does. If the time of day of the test varies from one district to the next in a 
way that is unrelated to class size, then the time of day and class size would be uncor- 
related, so the first condition does not hold. Conversely, the time of day of the test 
could affect scores (alertness varies through the school day), so the second condition 
holds. However, because in this example the time of day the test is administered is 
uncorrelated with the student-teacher ratio, the student-teacher ratio could not be 
incorrectly picking up the “time of day” effect. Thus omitting the time of day of the 
test does not result in omitted variable bias. 


Example 3: Parking lot space per pupil. Another omitted variable is parking lot 
space per pupil (the area of the teacher parking lot divided by the number of stu- 
dents). This variable satisfies the first but not the second condition for omitted vari- 
able bias. Specifically, schools with more teachers per pupil probably have more 
teacher parking space, so the first condition would be satisfied. However, under the 
assumption that learning takes place in the classroom, not the parking lot, parking 
lot space has no direct effect on learning; thus the second condition does not hold. 
Because parking lot space per pupil is not a determinant of test scores, omitting it 
from the analysis does not lead to omitted variable bias. 
Omitted variable bias is summarized in Key Concept 6.1. 


Omitted variable bias and the first least squares assumption. Omitted variable bias 
means that the first least squares assumption for causal inference—that E(u; | X;) = 0, 
as listed in Key Concept 4.3— does not hold. To see why, recall that the error term uw; in the 
linear regression model with a single regressor represents all factors, other than X; that are 
determinants of Y. If one of these other factors is correlated with X; this means that the 
error term (which contains this factor) is correlated with X;. In other words, if an omitted 
variable is a determinant of Y, then it is in the error term, and if it is correlated with X;, then 
the error term is correlated with X;. Because u; and X; are correlated, the conditional 
mean of u; given X; is nonzero. This correlation therefore violates the first least squares 
assumption, and the consequence is serious: The OLS estimator is biased. This bias 
does not vanish even in very large samples, and the OLS estimator is inconsistent. 
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A Formula for Omitted Variable Bias 


The discussion of the previous section about omitted variable bias can be summarized 
mathematically by a formula for this bias. Let the correlation between X; and u; be 
corr( X;, u;) = Pyu. Suppose that the second and third least squares assumptions 
hold, but the first does not because py, is nonzero. Then the OLS estimator has the 
limit (derived in Appendix 6.1) 
Êi > Bı + Prin (6.1) 

That is, as the sample size increases, Êi is close to B, + px, (o,,/ox) with increasingly 
high probability. 

The formula in Equation (6.1) summarizes several of the ideas discussed above 
about omitted variable bias: 


1. Omitted variable bias is a problem whether the sample size is large or small. 
Because ĝ; does not converge in probability to the true value f;, Ĝi is biased 
and inconsistent; that is, Bi is not a consistent estimator of 6, when there is 
omitted variable bias. The term py,,(0,,/ox) in Equation (6.1) is the bias in Bi 


that persists even in large samples. 


Is Coffee Good for Your Health? 


A- published in the Annals of Internal 
Medicine (Gunter, Murphy, Cross, et al. 2017) 
suggested that drinking coffee is linked to a lower 
risk of disease or death.! This study was based on 
examining 521,330 participants for a mean period of 
16 years in 10 European countries. From this sam- 
ple group, 41,693 deaths were recorded during this 
period. Another recent study published in The Jour- 
nal of the American Medical Association (Loftfield, 
Cornelis, Caporaso, et al. 2018) investigated the link 
between heavy intake of coffee and risk of mortal- 
ity. It suggested that drinking six—-seven cups of cof- 
fee per day was associated with a 16% lower risk of 
death.” This study attracted substantial attention in 
the U.K. press, with articles bearing headlines such 
as “Six coffees a day could save your life” and “Have 
another cup of coffee! Six cups a day could decrease 
your risk of early death by up to 16%, National Can- 
cer Institute study finds.”? 

Are these headlines accurate? Perhaps not. While 
they suggest a causal relationship between coffee 


and life expectancy, there is the potential for omitted 


variable bias to influence the relationship being estab- 
lished. Reviews of this study, including those by the 
United Kingdom’s National Health Service (NHS) 
and the BMJ,’ note that some people may opt not to 
drink coffee if they know they have an illness already. 
Similarly, coffee can be considered as a surrogate 
endpoint for factors that affect health—income, 
education, or deprivation—that may confound the 
observed beneficial associations and introduce errors. 

According to a paper published in BMJ (Poole, 
Kennedy, Roderick, et al. 2017), randomized con- 
trolled trials (RCTs), or randomized controlled experi- 
ments, allow for many of these errors to be removed. 
In this case, removing the ability of people to select if 
they should drink coffee and how much they should 
consume would remove any omitted variable bias aris- 
ing from differences in income or in expectations about 
health among coffee drinkers and non-coffee drinkers. 

Sometimes, however, there may be neither a 
genuine relationship that an RCT could detect, nor 
even an omitted variable responsible for the rela- 


tionship. The website “Spurious Correlations”> 


details many such examples. For instance, the per 
capita consumption of mozzarella cheese over time 
shows a strong, and coincidental, relationship with 
the award of civil engineering doctorates. Be careful 


when interpreting the results of regressions! 


See the studies by Gunter, Murphy, Cross, et al., “Cof- 
fee Drinking and Mortality in 10 European Countries: A 
Multinational Cohort Study,” Annals of Internal Medicine, 
http://annals.org, July 11, 2017. 


Read the paper on “Association of Coffee Drinking With 
Mortality by Genetic Variation in Caffeine Metabolism, 
Findings From the UK Biobank,” by See Loftfield, Cornelis, 
Caporaso, et al., published in JAMA Internal Medicine, 
July 2, 2018. 
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3Laura Donnelly, “Six Coffees a Day Could save Your 
Life,” The Telegraph, July 2, 2018, https://www.telegraph 
.co.uk; and Mary Kekatos, “Have Another Cup of Coffee! 
Six Cups a Day Could Decrease Your Risk of Early Death 
by up to 16%, National Cancer Institute Study Finds,” The 
Daily Mail, July 2, 2018. 


“For further reading, see “Another Study Finds Coffee 
Might Reduce Risk of Premature Death,” on the NHS 
website; and “Coffee Consumption and Health: Umbrella 
Review of Meta-analyses of Multiple Health Outcomes,” 
by Robin Poole, Oliver J Kennedy, Paul Roderick, Jona- 
than A. Fallowfield, Peter C Hayes, and Julie Parkes, 
published on the British Medical Journal (BMJ) website, 
October 16, 2017, http://dx.doi.org/10.1136/bmj.j5024. 


>For further information, see Spurious Correlations, http:// 
www.tylervigen.com/spurious-correlations. 
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2. Whether this bias is large or small in practice depends on the correlation 
px, between the regressor and the error term. The larger | py, | is, the 
larger the bias. 

3. The direction of the bias in Â depends on whether X and u are positively or 
negatively correlated. For example, we speculated that the percentage of stu- 
dents learning English has a negative effect on district test scores (students still 
learning English have lower scores), so that the percentage of English learn- 
ers enters the error term with a negative sign. In our data, the fraction of Eng- 
lish learners is positively correlated with the student-teacher ratio (districts 
with more English learners have larger classes). Thus the student-teacher 
ratio (X) would be negatively correlated with the error term (u), so px, < 0 
and the coefficient on the student-teacher ratio Bi would be biased toward a 
negative number. In other words, having a small percentage of English learn- 
ers is associated with both high test scores and low student-teacher ratios, so 
one reason that the OLS estimator suggests that small classes improve test 
scores may be that the districts with small classes have fewer English learners. 


Addressing Omitted Variable Bias by Dividing 
the Data into Groups 


What can you do about omitted variable bias? In the test score example, class size is 
correlated with the fraction of English learners. One way to address this problem is 
to select a subset of districts that have the same fraction of English learners but have 
different class sizes: For that subset of districts, class size cannot be picking up the 
English learner effect because the fraction of English learners is held constant. More 
generally, this observation suggests estimating the effect of the student-teacher ratio 
on test scores, holding constant the percentage of English learners. 

Table 6.1 reports evidence on the relationship between class size and test scores within 
districts with comparable percentages of English learners. Districts are divided into eight 
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Differences in Test Scores for California School Districts with Low and High 


Ratio < 20 Ratio = 20 Teacher Ratio 
Average Average 
Test Score n Test Score n Difference t-statistic 


All districts 6574 238 650.0 182 74 4.04 
Percentage of English learners 
< 1.9% 664.5 76 665.4 27 —0.9 —0.30 
1.9-8.8% 665.2 64 661.8 44 3.3 1.13 
8.8-23.0% 654.9 54 649.7 50 5:2 1.72 

\ > 23.0% 636.7 44 634.8 61 1.9 0.68 J 


Student-Teacher Ratios, by the Percentage of English Learners in the District 


Difference in Test Scores, 
Student-Teacher Student-Teacher Low vs. High Student- 


groups. First, the districts are broken into four categories that correspond to the quartiles 
of the distribution of the percentage of English learners across districts. Second, within each 
of these four categories, districts are further broken down into two groups, depending on 
whether the student-teacher ratio is small (STR < 20) or large (STR = 20). 

The first row in Table 6.1 reports the overall difference in average test scores 
between districts with low and high student-teacher ratios —that is, the difference in 
test scores between these two groups without breaking them down further into the 
quartiles of English learners. (Recall that this difference was previously reported in 
regression form in Equation (5.18) as the OLS estimate of the coefficient on D; in the 
regression of TestScore on D; where D;, is a binary regressor that equals 1if STR; < 20 
and equals 0 otherwise.) Over the full sample of 420 districts, the average test score 
is 74 points higher in districts with a low student-teacher ratio than a high one; the 
t-statistic is 4.04, so the null hypothesis that the mean test score is the same in the two 
groups is rejected at the 1% significance level. 

The final four rows in Table 6.1 report the difference in test scores between districts 
with low and high student-teacher ratios, broken down by the quartile of the percentage 
of English learners. This evidence presents a different picture. Of the districts with the few- 
est English learners (< 1.9% ), the average test score for those 76 with low student- 
teacher ratios is 664.5, and the average for the 27 with high student-teacher ratios is 665.4. 
Thus, for the districts with the fewest English learners, test scores were, on average, 0.9 
points lower in the districts with low student-teacher ratios! In the second quartile, districts 
with low student-teacher ratios had test scores that averaged 3.3 points higher than those 
with high student-teacher ratios; this gap was 5.2 points for the third quartile and only 1.9 
points for the quartile of districts with the most English learners. Once we hold the percent- 
age of English learners constant, the difference in performance between districts with high 
and low student-teacher ratios is perhaps half (or less) of the overall estimate of 74 points. 

At first, this finding might seem puzzling. How can the overall effect of test scores be 
twice the effect of test scores within any quartile? The answer is that the districts with the 
most English learners tend to have both the highest student-teacher ratios and the lowest 
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test scores. The difference in the average test scores between districts in the lowest and 
highest quartiles of the percentage of English learners is large, approximately 30 points. 
The districts with few English learners tend to have lower student-teacher ratios: 74% 
(76 of 103) of the districts in the first quartile of English learners have small classes 
(STR < 20), while only 42% (44 of 105) of the districts in the quartile with the most 
English learners have small classes. So the districts with the most English learners have 
both lower test scores and higher student-teacher ratios than the other districts. 

This analysis reinforces the superintendent’s worry that omitted variable bias is pres- 
ent in the regression of test scores against the student-teacher ratio. By looking within 
quartiles of the percentage of English learners, the test score differences in the second part 
of Table 6.1 improve on the simple difference-of-means analysis in the first line of Table 6.1. 
Still, this analysis does not yet provide the superintendent with a useful estimate of the 
effect on test scores of changing class size, holding constant the fraction of English learners. 
Such an estimate can be provided, however, using the method of multiple regression. 


The Multiple Regression Model 


The multiple regression model extends the single variable regression model of Chapters 4 
and 5 to include additional variables as regressors. When used for causal inference, this 
model permits estimating the effect on Y; of changing one variable (Xj; ) while holding 
the other regressors (X3;, Xz; and so forth) constant. In the class size problem, the mul- 
tiple regression model provides a way to isolate the effect on test scores ( Y;) of the 
student-teacher ratio (X;;) while holding constant the percentage of students in the 
district who are English learners (_X>;). When used for prediction, the multiple regression 
model can improve predictions by using multiple variables as predictors. 

As in Chapter 4, we introduce the terminology and statistics of multiple regres- 
sion in the context of prediction. Section 6.5 returns to causal inference and formal- 
izes the requirements for multiple regression to eliminate omitted variable bias in the 
estimation of a causal effect. 


The Population Regression Line 


Suppose for the moment that there are only two independent variables, X; and X>; 
In the linear multiple regression model, the average relationship between these two 
independent variables and the dependent variable, Y, is given by the linear function 


E(Y;,| Xu = xp Xo; = x2) = Bo + Bier + Box, (6.2) 


where E(Y|X; = x1, X2; = x2) is the conditional expectation of Y; given that 
Xi; = x, and X; = x). That is, if the student-teacher ratio in the i district (Xj;) 
equals some value x, and the percentage of English learners in the i™ district (X3;) 
equals x, then the expected value of Y; given the student-teacher ratio and the 
percentage of English learners is given by Equation (6.2). 


218 


CHAPTER 6 Linear Regression with Multiple Regressors 


Equation (6.2) is the population regression line or population regression function 
in the multiple regression model. The coefficient Bp is the intercept; the coefficient £ 
is the slope coefficient of Xj; or, more simply, the coefficient on Xj,;; and the coeffi- 
cient B> is the slope coefficient of X}; or, more simply, the coefficient on X3;. 

The interpretation of the coefficient £, in Equation (6.2) is different than it was when 
X; was the only regressor: In Equation (6.2), 8; is the predicted difference in Y between 
two observations with a unit difference in Xj, holding X, constant or controlling for X3. 

This interpretation of B, follows from comparing the predictions (conditional 
expectations) for two observations with the same value of X, but with values of X, 
that differ by AX;, so that the first observation has X values (X,, X2) and the second 
observation has X values (X, + AX}, X). For the first observation, the predicted 
value of Y is given by Equation (6.2); write this as Y = By + B,X, + BX. For the 
second observation, the predicted value of Yis Y + AY, where 


An equation for AY in terms of AX; is obtained by subtracting the equation 
Y = Bo + BX, + kX: from Equation (6.3), yielding AY = 614X. Rearranging this 
equation shows that 


B = n holding X constant. (6.4) 
Thus the coefficient B, is the difference in the predicted values of Y (the difference 
in the conditional expectations of Y) between two observations with a unit difference 
in X;, holding X; fixed. Another term used to describe £; is the partial effect on Y of 
X, holding X, fixed. 

The interpretation of the intercept in the multiple regression model, Bọ, is similar 
to the interpretation of the intercept in the single-regressor model: It is the expected 
value of Y, when Xj; and X;; are 0. Simply put, the intercept By determines how far up 
the Y axis the population regression line starts. 


The Population Multiple Regression Model 


The population regression line in Equation (6.2) is the relationship between Y and X; and 
X; that holds, on average, in the population. Just as in the case of regression with a single 
regressor, however, this relationship does not hold exactly because many other factors 
influence the dependent variable. In addition to the student-teacher ratio and the fraction 
of students still learning English, for example, test scores are influenced by school charac- 
teristics, other student characteristics, and luck. Thus the population regression function 
in Equation (6.2) needs to be augmented to incorporate these additional factors. 

Just as in the case of regression with a single regressor, the factors that determine 
Y; in addition to Xj; and Xy; are incorporated into Equation (6.2) as an “error” term 
u;. Accordingly, we have 


Y; = Bo + Bika + BX + upi =1,...,n, (6.5) 
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where the subscript i indicates the i™ of the n observations (districts) in the sample. 
Equation (6.5) is the population multiple regression model when there are two 
regressors, Xj; and X; 
It can be useful to treat By as the coefficient on a regressor that always equals 1; 
think of Bp as the coefficient on Xo; where Xp; = 1 fori = 1, ...,n. Accordingly, the 
population multiple regression model in Equation (6.5) can alternatively be written as 


Y = Bo Xo; + BX; + BX); + Ui, where Xoi = 1,i = 1, TETT (6.6) 


The variable Xo; is sometimes called the constant regressor because it takes on the 
same value—the value 1—for all observations. Similarly, the intercept, Bo, is some- 
times called the constant term in the regression. 

The two ways of writing the population regression model, Equations (6.5) and 
(6.6), are equivalent. 

The discussion so far has focused on the case of a single additional variable, X>. 
In applications, it is common to have more than two regressors. This reasoning leads 
us to consider a model that includes k regressors. The multiple regression model with 
k regressors, Xi; Xi, ... , Xki, iS summarized as Key Concept 6.2. 

The definitions of homoskedasticity and heteroskedasticity in the multiple regres- 
sion model extend their definitions in the single-regressor model. The error term u;in the 
multiple regression model is homoskedastic if the variance of the conditional distribution 


of u; given Xj;,..., Xi, var (u;| Xin ..., Xp), is constant fori = 1, . . . ,n, and thus does 
not depend on the values of Xj;,.. . , X;;. Otherwise, the error term is heteroskedastic. 
The Multiple Regression Model 


6.2 


The multiple regression model is 


Y; = Bo a BX; qf BX; Shade T By Xj ar Upi = il, E T (6.7) 
where 


e Y, is i observation on the dependent variable; Xi; X>;,..., Xj; are the i 
observations on each of the k regressors; and u; is the error term. 


e The population regression line is the relationship that holds between Y and 
the X’s, on average, in the population: 


E(Y |X; = Xi, Xz; = AG eee XK; = X) = Bo + Byx, + Box. + ene ae Deve 


e Bis the slope coefficient on Xj, By is the slope coefficient on X, and so on. The 
coefficient 8, is the expected difference in Y; associated with a unit difference 
in X;, holding constant the other regressors, X, ... , X;. The coefficients on 
the other X’s are interpreted similarly. 


e The intercept Bp is the expected value of Y when all the X’s equal 0. The intercept 
can be thought of as the coefficient on a regressor, Xp, that equals 1 for all i. 
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6.3 


The OLS Estimator in Multiple Regression 


To be of practical value, we need to estimate the unknown population coefficients 
Bo, --- , Bp using a sample of data. As in regression with a single regressor, these coef- 
ficients can be estimated using ordinary least squares. 


The OLS Estimator 


Section 4.2 shows how to estimate the intercept and slope coefficients in the single- 
regressor model by applying OLS to a sample of observations of Y and X.The key idea 
is that these coefficients can be estimated by minimizing the sum of squared prediction 
mistakes—that is, by choosing the estimators by and bı so as to minimize 
>. i(¥; — bo — b1X;) 2. The estimators that do so are the OLS estimators, Ê and Â. 

The method of OLS also can be used to estimate the coefficients Bo, B1, ..., Bk 
in the multiple regression model. Let bo, b4, . . . , Dg be estimates of Bo, By, ..., Be. 
The predicted value of Y; calculated using these estimates, is bọ + b,Xy; +--+ 
b,X,;, and the mistake in predicting Y; is Y; — (bo + bX; + +++ + bX) = 
Y; — bo — bX; — +++ — b,X,;.The sum of these squared prediction mistakes over 


all n observations is thus 


X (YX, = bo = bX = = Bey (6.8) 
= 


The sum of the squared mistakes for the linear regression model in Expression (6.8) is 
the extension of the sum of the squared mistakes given in Equation (4.4) for the 
linear regression model with a single regressor. 

The estimators of the coefficients fp, B),..., Bp that minimize the sum of 
squared mistakes in Expression (6.8) are called the ordinary least squares (OLS) 
estimators of Bo, 8), ..., By. The OLS estimators are denoted Bis Bi pii Bis 

The terminology of OLS in the linear multiple regression model is the same as 
in the linear regression model with a single regressor. The OLS regression line is the 
straight line constructed using the OLS estimators: Bo + ÊX ++ BX. The 
predicted value of Y; given Xj;,...,X,;, based on the OLS regression line, is 
Ê = Bo + BX; oe Sas B, Xj. The OLS residual for the i" observation is the differ- 
ence between Y; and its OLS predicted value; that is, the OLS residual is i; = Y, — Ê. 

The OLS estimators could be computed by trial and error, repeatedly trying dif- 
ferent values of bọ, . . . , 5, until you are satisfied that you have minimized the total 
sum of squares in Expression (6.8). It is far easier, however, to use explicit formulas for 
the OLS estimators that are derived using calculus. The formulas for the OLS estima- 
tors in the multiple regression model are similar to those in Key Concept 4.2 for the 
single-regressor model. These formulas are incorporated into modern statistical soft- 
ware. In the multiple regression model, the formulas are best expressed and discussed 
using matrix notation, so their presentation is deferred to Section 19.1. 

The definitions and terminology of OLS in multiple regression are summarized 
in Key Concept 6.3. 
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The OLS Estimators, Predicted Values, and Residuals 
in the Multiple Regression Model 6.3 
The OLS estimators Bos Bi en Br are the values of bo, b4, . . . , by that minimize 
the sum of squared prediction errors 5’; ,(Y; — bo — b,X1; — +++ — byXqi)*. The 
OLS predicted values Y and residuals i; are 
¥, = & + ÊX + BX i = 1,...,n, and (6.9) 
û = Y- = ee n. (6.10) 
The OLS estimators Bos Bi. eee Bx and residual ĉ; are computed from a sample 
of n observations of (Xj;,..., Xki ¥;),i = 1,...,.These are estimators of the 
unknown true population coefficients Bo, 61, . - - , Bg and error term u;. 


Application to Test Scores and the Student-Teacher Ratio 


In Section 4.2, we used OLS to estimate the intercept and slope coefficient of the 
regression relating test scores (TestScore) to the student-teacher ratio (STR), using 
our 420 observations for California school districts. The estimated OLS regression 
line, reported in Equation (4.9), is 


-a 
TestScore = 698.9 — 2.28 X STR. (6.11) 


From the perspective of the father looking for a way to predict test scores, this rela- 
tion is not very satisfying: its R? is only 0.051; that is, the student-teacher ratio 
explains only 5.1% of the variation in test scores. Can this prediction be made more 
precise by including additional regressors? 

To find out, we estimate a multiple regression with test scores as the dependent 
variable ( Y,) and with two regressors: the student-teacher ratio (Xj;) and the per- 
centage of English learners in the school district (X3;). The OLS regression line, 


estimated using our 420 districts (i = 1,..., 420), is 
rE 
TestScore = 686.0 — 1.10 X STR — 0.65 X PctEL, (6.12) 


where PctEL is the percentage of students in the district who are English learners. 
The OLS estimate of the intercept ( Bo) is 686.0, the OLS estimate of the coefficient 
on the student-teacher ratio (Êi) is —1.10, and the OLS estimate of the coefficient 
on the percentage English learners (f) is —0.65. 

The coefficient on the student-teacher ratio in the multiple regression is approx- 
imately half as large as when the student-teacher ratio is the only regressor, —1.10 


vs. —2.28. This difference occurs because the coefficient on STR in the multiple 
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6.4 


regression holds constant (or controls for) PctEL, whereas in the single-regressor 
regression, PctEL is not held constant. 

The decline in the magnitude of the coefficient on the student-teacher ratio, 
once one controls for PctEL, parallels the findings in Table 6.1. There we saw that, among 
schools within the same quartile of percentage of English learners, the difference in test 
scores between schools with a high vs. a low student-teacher ratio is less than the differ- 
ence if one does not hold constant the percentage of English learners. As in Table 6.1, this 
strongly suggests that, from the perspective of causal inference, the original estimate of 
the effect of the student-teacher ratio on test scores in Equation (6.11) is subject to 
omitted variable bias. 

Equation (6.12) provides multiple regression estimates that the father can use 
for prediction, now using two predictors; we have not yet, however, answered his 
question as to whether the quality of that prediction has been improved.To do so, we 
need to extend the measures of fit in the single-regressor model to multiple 
regression. 


Measures of Fit in Multiple Regression 


Three commonly used summary statistics in multiple regression are the standard 
error of the regression, the regression R?, and the adjusted R? (also known as R°). All 
three statistics measure how well the OLS estimate of the multiple regression line 
describes, or “fits,” the data. 


The Standard Error of the Regression (SER) 


The standard error of the regression (SER) estimates the standard deviation of the 
error term u;. Thus the SER is a measure of the spread of the distribution of Y around 
the regression line. In multiple regression, the SER is 


SSR 


ie 


SER = s} = V/s2, where s= COE Si = 
n-k-1& 
and where SSR is the sum of squared residuals, SSR = D/'_,0?. 

The only difference between the definition of the SER in Equation (6.13) and 
the definition of the SER in Section 4.3 for the single-regressor model is that here 
the divisor isn — k — 1 rather than n — 2. In Section 4.3, the divisor n — 2 (rather 
than n) adjusts for the downward bias introduced by estimating two coefficients (the 
slope and intercept of the regression line). Here, the divisor n — k — 1 adjusts for 
the downward bias introduced by estimating k + 1 coefficients (the k slope coeffi- 
cients plus the intercept). As in Section 4.3, using n — k — 1 rather than 7 is called a 
degrees-of-freedom adjustment. If there is a single regressor, then k = 1, so the for- 
mula in Section 4.3 is the same as that in Equation (6.13). When n is large, the effect 
of the degrees-of-freedom adjustment is negligible. 
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The R? 


The regression R? is the fraction of the sample variance of Y, explained by (or pre- 
dicted by) the regressors. Equivalently, the R? is 1 minus the fraction of the variance 
of Y; not explained by the regressors. 

The mathematical definition of the R? is the same as for regression with a single 
regressor: 


R = =1 6.14 
TSS TSS’ (614) 


where the explained sum of squares is ESS = "_,(¥; — Y)? and the total sum of 
squares is TSS = D’,_,(¥; — Y)’. 

In multiple regression, the R? increases whenever a regressor is added unless the 
estimated coefficient on the added regressor is exactly 0. To see this, think about 
starting with one regressor and then adding a second. When you use OLS to estimate 
the model with both regressors, OLS finds the values of the coefficients that minimize 
the sum of squared residuals. If OLS happens to choose the coefficient on the new 
regressor to be exactly 0, then the SSR will be the same whether or not the second 
variable is included in the regression. But if OLS chooses any value other than 0, 
then it must be that this value reduced the SSR relative to the regression that 
excludes this regressor. In practice, it is extremely unusual for an estimated coef- 
ficient to be exactly 0, so in general the SSR will decrease when a new regressor is 
added. But this means that the R? generally increases (and never decreases) when 
a new regressor is added. 


The Adjusted R? 


Because the R? increases when a new variable is added, an increase in the R? does 
not mean that adding a variable actually improves the fit of the model. In this sense, 
the R? gives an inflated estimate of how well the regression fits the data. One way to 
correct for this is to deflate or reduce the R? by some factor, and this is what the 
adjusted R?, or R’, does. 

The adjusted R?, or R?, is a modified version of the R? that does not necessarily 
increase when a new regressor is added. The R? is 


n-1 SSR _ s% 


R? =1 Sa 
n-k-—-1 TSS E 


(6.15) 


The difference between this formula and the second definition of the R? in Equation 
(6.14) is that the ratio of the sum of squared residuals to the total sum of squares is mul- 
tiplied by the factor (n — 1)/(n — k — 1).As the second expression in Equation (6.15) 
shows, this means that the adjusted R? is 1 minus the ratio of the sample variance of the 
OLS residuals [with the degrees-of-freedom correction in Equation (6.13)] to the sample 
variance of Y. 
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There are three useful things to know about the R ?. First, (n — 1)/(n — k — 1) 
is always greater than 1,so R? is always less than R?. 

Second, adding a regressor has two opposite effects on the R*. On the one hand, 
the SSR falls, which increases the R*. On the other hand, the factor 
(n — 1)/(n — k — 1) increases. Whether the R? increases or decreases depends on 
which of these two effects is stronger. 

Third, the R? can be negative. This happens when the regressors, taken together, 
reduce the sum of squared residuals by such a small amount that this reduction fails 
to offset the factor (n — 1)/(n — k — 1). 


Application to Test Scores 


Equation (6.12) reports the estimated regression line for the multiple regression 
relating test scores (TestScore) to the student-teacher ratio (STR) and the 
percentage of English learners (PctEL). The R? for this regression line is 
R? = 0.426, the adjusted R? is R? = 0.424, and the standard error of the regression 
is SER = 14.5. 

Comparing these measures of fit with those for the regression in which PctEL 
is excluded [Equation (5.8)] shows that including PctEL in the regression increases 
the R? from 0.051 to 0.426. When the only regressor is STR, only a small fraction of 
the variation in TestScore is explained; however, when PctEL is added to the regres- 
sion, more than two-fifths (42.6%) of the variation in test scores is explained. In 
this sense, including the percentage of English learners substantially improves the 
fit of the regression. Because n is large and only two regressors appear in Equation 
(6.12), the difference between R? and adjusted R? is very small (R? = 0.426 vs. 
R? = 0.424). 

The SER for the regression excluding PctEL is 18.6; this value falls to 14.5 when 
PctEL is included as a second regressor. The units of the SER are points on the stan- 
dardized test. The reduction in the SER tells us that predictions about standardized test 
scores are substantially more precise if they are made using the regression with both 
STR and PctEL than if they are made using the regression with only STR as a 
regressor. 


Using the R? and adjusted R°. The R? is useful because it quantifies the extent to 
which the regressors account for, or explain, the variation in the dependent variable. 
Nevertheless, heavy reliance on the R? (or R?) can be a trap. 

In applications in which the goal is to produce reliable out-of-sample predictions, 
including many regressors can produce a good in-sample fit but can degrade the out- 
of-sample performance. Although the R? improves upon the R? for this purpose, 
simply maximizing the R? still can produce poor out-of-sample forecasts. We return 
to this issue in Chapter 14. 

In applications in which the goal is causal inference, the decision about whether 
to include a variable in a multiple regression should be based on whether including 
that variable allows you better to estimate the causal effect of interest. The least 
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squares assumptions for causal inference in multiple regression make precise the 
requirements for an included variable to eliminate omitted variable bias, and we now 
turn to those assumptions. 


The Least Squares Assumptions for Causal 
Inference in Multiple Regression 


In this section, we make precise the requirements for OLS to provide valid inferences 
about causal effects. We consider the case in which we are interested in knowing the 
causal effects of all k regressors in the multiple regression model; that is, all the coef- 
ficients B;,..., B are causal effects of interest. Section 6.8 presents the least squares 
assumptions that apply when only some of the coefficients are causal effects, while 
the rest are coefficients on variables included to control for omitted factors and do 
not necessarily have a causal interpretation. Appendix 6.4 provides the least squares 
assumptions for prediction with multiple regression. 

There are four least squares assumptions for causal inference in the multiple 
regression model. The first three are those of Section 4.3 for the single-regressor model 
(Key Concept 4.3) extended to allow for multiple regressors, and they are discussed 
here only briefly. The fourth assumption is new and is discussed in more detail. 


Assumption 1: The Conditional Distribution of u; Given 
Xii Xr), - - -, Xj Has a Mean of 0 


The first assumption is that the conditional distribution of u; given Xj;,..., X;,; has a 
mean of 0. This assumption extends the first least squares assumption with a single 
regressor to multiple regressors. This assumption is implied if Xj;,..., X;,; are ran- 
domly assigned or are as-if randomly assigned; if so, for any value of the regressors, 
the expected value of u; is 0. As is the case for regression with a single regressor, this 
is the key assumption that makes the OLS estimators unbiased. 


Assumption 2: (Xj Xo, ess Xi Yi = 1) +.) Are Ld, 

The second assumption is that (Xj;,..., Xk Y;), i = 1,...,, are independently and 
identically distributed (i.i.d.) random variables. This assumption holds automatically if 
the data are collected by simple random sampling. The comments on this assumption 
appearing in Section 4.3 for a single regressor also apply to multiple regressors. 


Assumption 3: Large Outliers Are Unlikely 


The third least squares assumption is that large outliers—that is, observations with 
values far outside the usual range of the data—are unlikely. This assumption serves 
as a reminder that, as in the single-regressor case, the OLS estimator of the coeffi- 
cients in the multiple regression model can be sensitive to large outliers. 
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The assumption that large outliers are unlikely is made mathematically precise by 
assuming that Xj,...,X;,; and Y, have nonzero finite fourth moments: 
0 < E(X},) < ©,...,0 < E(X%;) < ~ and0 < E(Y?) < œ. Another way to state 
this assumption is that the dependent variable and regressors have finite kurtosis. This 
assumption is used to derive the properties of OLS regression statistics in large samples. 


Assumption 4: No Perfect Multicollinearity 


The fourth assumption is new to the multiple regression model. It rules out an incon- 
venient situation called perfect multicollinearity, in which it is impossible to compute 
the OLS estimator. The regressors are said to exhibit perfect multicollinearity (or to 
be perfectly multicollinear) if one of the regressors is a perfect linear function of the 
other regressors. The fourth least squares assumption is that the regressors are not 
perfectly multicollinear. 

Why does perfect multicollinearity make it impossible to compute the OLS esti- 
mator? Suppose you want to estimate the coefficient on STR in a regression of 
TestScore; on STR; and PctEL; but you make a typographical error and accidentally 
type in STR; a second time instead of PctEL;; that is, you regress TestScore; on STR; 
and STR; This is a case of perfect multicollinearity because one of the regressors (the 
first occurrence of STR) is a perfect linear function of another regressor (the second 
occurrence of STR). Depending on how your software package handles perfect mul- 
ticollinearity, if you try to estimate this regression, the software will do one of two 
things: Either it will drop one of the occurrences of STR, or it will refuse to calculate 
the OLS estimates and give an error message. The mathematical reason for this fail- 
ure is that perfect multicollinearity produces division by 0 in the OLS formulas. 

At an intuitive level, perfect multicollinearity is a problem because you are ask- 
ing the regression to answer an illogical question. In multiple regression, the coeffi- 
cient on one of the regressors is the effect of a change in that regressor, holding the 
other regressors constant. In the hypothetical regression of TestScore on STR and 
STR, the coefficient on the first occurrence of STR is the effect on test scores of a 
change in STR, holding constant STR. This makes no sense, and OLS cannot estimate 
this nonsensical partial effect. 

The solution to perfect multicollinearity in this hypothetical regression is sim- 
ply to correct the typo and to replace one of the occurrences of STR with the vari- 
able you originally wanted to include. This example is typical: When perfect 
multicollinearity occurs, it often reflects a logical mistake in choosing the regres- 
sors or some previously unrecognized feature of the data set. In general, the solu- 
tion to perfect multicollinearity is to modify the regressors to eliminate the 
problem. 

Additional examples of perfect multicollinearity are given in Section 6.7, which 
also defines and discusses imperfect multicollinearity. 

The least squares assumptions for the multiple regression model are summarized 
in Key Concept 6.4. 
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The Least Squares Assumptions for Causal Inference 
in the Multiple Regression Model 6.4 
Vee y Bia XG FS BS et = ooa aal 
where f4, .. . , Bk are causal effects and 
1. u; has a conditional mean of 0 given Xi; X>;,..., X;;3; that is, 


2. (Xi Xi, s 


E(u; | Xi, Xj, .. , Xx) = 0. 


., Xj, Y) i = 1,...,n, are independently and identically dis- 


tributed (i.i.d.) draws from their joint distribution. 


3. Large outliers are unlikely: Xj;,...,X;; and Y; have nonzero finite fourth 


moments. 


4. There is no perfect multicollinearity. 


6.6 


The Distribution of the OLS Estimators 
in Multiple Regression 


Because the data differ from one sample to the next, different samples produce dif- 
ferent values of the OLS estimators. This variation across possible samples gives rise 
to the uncertainty associated with the OLS estimators of the population regression 
coefficients, Bp, B1, ..., By. Just as in the case of regression with a single regressor, this 
variation is summarized in the sampling distribution of the OLS estimators. 

Recall from Section 4.4 that, under the least squares assumptions, the OLS esti- 
mators (Bo and Bi) are unbiased and consistent estimators of the unknown coeffi- 
cients (6 and £) in the linear regression model with a single regressor. In addition, 
in large samples, the sampling distribution of By and Ĝĝ, is well approximated by a 
bivariate normal distribution. 

These results carry over to multiple regression analysis. That is, under the least 


squares assumptions of Key Concept 6.4, the OLS estimators Be Bis nee Br are unbi- 
ased and consistent estimators of Bo, B),..., Bk in the linear multiple regression 
model. In large samples, the joint sampling distribution of Bo, B),..., By is well 


approximated by a multivariate normal distribution, which is the extension of the 
bivariate normal distribution to the general case of two or more jointly normal 
random variables (Section 2.4). 

Although the algebra is more complicated when there are multiple regressors, 
the central limit theorem applies to the OLS estimators in the multiple regression 
model for the same reason that it applies to Y and to the OLS estimators when there 
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Large-Sample Distribution of Bo, B1,---,Bx 


6.5 


6.7 


If the least squares assumptions (Key Concept 6.4) hold, then in large samples 
the OLS estimators Bos Bi, ers Bx are jointly normally distributed, and each Ê; 
is distributed N(6;, of), j = 0,...,k. 


is a single regressor: The OLS estimators Bos Bis TP Bu are averages of the randomly 
sampled data, and if the sample size is sufficiently large, the sampling distribution of 
those averages becomes normal. Because the multivariate normal distribution is best 
handled mathematically using matrix algebra, the expressions for the joint distribu- 
tion of the OLS estimators are deferred to Chapter 19. 

Key Concept 6.5 summarizes the result that, in large samples, the distribution of 
the OLS estimators in multiple regression is approximately jointly normal. In gen- 
eral, the OLS estimators are correlated; this correlation arises from the correlation 
between the regressors. The joint sampling distribution of the OLS estimators is dis- 
cussed in more detail for the case where there are two regressors and homoskedastic 
errors in Appendix 6.2, and the general case is discussed in Section 19.2. 


Multicollinearity 


As discussed in Section 6.5, perfect multicollinearity arises when one of the regressors 
is a perfect linear combination of the other regressors. This section provides some 
examples of perfect multicollinearity and discusses how perfect multicollinearity can 
arise, and can be avoided, in regressions with multiple binary regressors. Imperfect 
multicollinearity arises when one of the regressors is very highly correlated—but not 
perfectly correlated — with the other regressors. Unlike perfect multicollinearity, imper- 
fect multicollinearity does not prevent estimation of the regression, nor does it imply 
a logical problem with the choice of regressors. However, it does mean that one or 
more regression coefficients could be estimated imprecisely. 


Examples of Perfect Multicollinearity 


We continue the discussion of perfect multicollinearity from Section 6.5 by examin- 
ing three additional hypothetical regressions. In each, a third regressor is added to 
the regression of TestScore;on STR; and PctEL; in Equation (6.12). 


Example 1: Fraction of English learners. Let FracEL; be the fraction of English 
learners in the i" district, which varies between 0 and 1. If the variable FracEL; were 
included as a third regressor in addition to STR; and PctEL;, the regressors would be 
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perfectly multicollinear. The reason is that PctEL is the percentage of English learners, 
so that PctEL; = 100 x FracEL; for every district. Thus one of the regressors 
(PctEL;) can be written as a perfect linear function of another regressor (FracEL,;). 

Because of this perfect multicollinearity, it is impossible to compute the OLS 
estimates of the regression of TestScore; on STR, PctEL;, and FracEL;. At an intui- 
tive level, OLS fails because you are asking, What is the effect of a unit change in the 
percentage of English learners, holding constant the fraction of English learners? 
Because the percentage of English learners and the fraction of English learners move 
together in a perfect linear relationship, this question makes no sense, and OLS can- 
not answer it. 


Example 2: “Not very small” classes. Let NVS; be a binary variable that equals 1 if 
the student-teacher ratio in the i district is “not very small”; specifically, NVS; 
equals 1 if STR; = 12 and equals 0 otherwise. This regression also exhibits perfect 
multicollinearity, but for a more subtle reason than the regression in the previous 
example. There are, in fact, no districts in our data set with STR; < 12; as you can see 
in the scatterplot in Figure 4.2, the smallest value of STR is 14. Thus NVS; = 1 for all 
observations. Now recall that the linear regression model with an intercept can 
equivalently be thought of as including a regressor, Xo; that equals 1 for all i, as 
shown in Equation (6.6). Thus we can write NVS; = 1 X Xo; for all the observations 
in our data set; that is, NVS; can be written as a perfect linear combination of the 
regressors; specifically, it equals Xo. 

This illustrates two important points about perfect multicollinearity. First, when 
the regression includes an intercept, then one of the regressors that can be implicated 
in perfect multicollinearity is the constant regressor Xo;. Second, perfect multicol- 
linearity is a statement about the data set you have on hand. While it is possible to 
imagine a school district with fewer than 12 students per teacher, there are no such 
districts in our data set, so we cannot analyze them in our regression. 


Example 3: Percentage of English speakers. Let PctES; be the percentage of English 
speakers in the i district, defined to be the percentage of students who are not 
English learners. Again the regressors will be perfectly multicollinear. Like the previ- 
ous example, the perfect linear relationship among the regressors involves the con- 
stant regressor Xo;: For every district, PctES; = 100 — PctEL; = 100 X Xo; — PctEL,; 
because Xo; = 1 for alli. 

This example illustrates another point: Perfect multicollinearity is a feature of 
the entire set of regressors. If either the intercept (that is, the regressor Xo;) or PctEL; 
were excluded from this regression, the regressors would not be perfectly 
multicollinear. 


The dummy variable trap. Another possible source of perfect multicollinearity 
arises when multiple binary, or dummy, variables are used as regressors. For example, 
suppose you have partitioned the school districts into three categories: rural, 
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suburban, and urban. Each district falls into one (and only one) category. Let these 
binary variables be Rural;, which equals 1 for a rural district and equals 0 otherwise; 
Suburban,; and Urban,. If you include all three binary variables in the regression 
along with a constant, the regressors will be perfectly multicollinear: Because each 
district belongstooneandonly onecategory, Rural; + Suburban; + Urban; = 1 = Xoi, 
where Xo; denotes the constant regressor introduced in Equation (6.6). Thus, to esti- 
mate the regression, you must exclude one of these four variables, either one of the 
binary indicators or the constant term. By convention, the constant term is typically 
retained, in which case one of the binary indicators is excluded. For example, if Rural; 
were excluded, then the coefficient on Suburban; would be the average difference 
between test scores in suburban and rural districts, holding constant the other vari- 
ables in the regression. 

In general, if there are G binary variables, if each observation falls into one and 
only one category, if there is an intercept in the regression, and if all G binary vari- 
ables are included as regressors, then the regression will fail because of perfect mul- 
ticollinearity. This situation is called the dummy variable trap. The usual way to avoid 
the dummy variable trap is to exclude one of the binary variables from the multiple 
regression, so only G — 1 of the G binary variables are included as regressors. In this 
case, the coefficients on the included binary variables represent the incremental 
effect of being in that category, relative to the base case of the omitted category, hold- 
ing constant the other regressors. Alternatively, all G binary regressors can be 
included if the intercept is omitted from the regression. 


Solutions to perfect multicollinearity. Perfect multicollinearity typically arises when 
a mistake has been made in specifying the regression. Sometimes the mistake is easy 
to spot (as in the first example), but sometimes it is not (as in the second example). 
In one way or another, your software will let you know if you make such a mistake 
because it cannot compute the OLS estimator if you have. 

When your software lets you know that you have perfect multicollinearity, it is 
important that you modify your regression to eliminate it. You should understand the 
source of the multicollinearity. Some software is unreliable when there is perfect 
multicollinearity, and at a minimum, you will be ceding control over your choice of 
regressors to your computer if your regressors are perfectly multicollinear. 


Imperfect Multicollinearity 


Despite its similar name, imperfect multicollinearity is conceptually quite different 
from perfect multicollinearity. Imperfect multicollinearity means that two or more 
of the regressors are highly correlated in the sense that there is a linear function of 
the regressors that is highly correlated with another regressor. Imperfect multicol- 
linearity does not pose any problems for the theory of the OLS estimators; on the 
contrary, one use of OLS is to sort out the independent influences of the various 
regressors when the regressors are correlated. 
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If the regressors are imperfectly multicollinear, then the coefficients on at least 
one individual regressor will be imprecisely estimated. For example, consider the 
regression of TestScore on STR and PctEL. Suppose we were to add a third regressor, 
the percentage of the district’s residents who are first-generation immigrants. First- 
generation immigrants often speak English as a second language, so the variables 
PctEL and percentage immigrants will be highly correlated: Districts with many 
recent immigrants will tend to have many students who are still learning English. 
Because these two variables are highly correlated, it would be difficult to use these 
data to estimate the coefficient on PctEL, holding constant the percentage of immi- 
grants. In other words, the data set provides little information about what happens to 
test scores when the percentage of English learners is low but the fraction of immi- 
grants is high, or vice versa. As a result, the OLS estimator of the coefficient on 
PctEL in this regression will have a larger variance than if the regressors PctEL and 
percentage immigrants were uncorrelated. 

The effect of imperfect multicollinearity on the variance of the OLS estimators 
can be seen mathematically by inspecting Equation (6.20) in Appendix 6.2, which is 
the variance of ĝ; in a multiple regression with two regressors (X, and X) for the 
special case of a homoskedastic error. In this case, the variance of By is inversely 
proportional to 1 — PX, x, Where py, x, is the correlation between X; and X. The 
larger the correlation between the two regressors, the closer this term is to 0, and the 
larger is the variance of ĝi. More generally, when multiple regressors are imperfectly 
multicollinear, the coefficients on one or more of these regressors will be imprecisely 
estimated; that is, they will have a large sampling variance. 

Perfect multicollinearity is a problem that often signals the presence of a logical 
error. In contrast, imperfect multicollinearity is not necessarily an error but rather 
just a feature of OLS, your data, and the question you are trying to answer. If the 
variables in your regression are the ones you meant to include —the ones you chose 
to address the potential for omitted variable bias— then imperfect multicollinearity 
implies that it will be difficult to estimate precisely one or more of the partial effects 
using the data at hand. 


Control Variables and Conditional 
Mean Independence 


In the test score example, we included the percentage of English learners in the 
regression to address omitted variable bias in the estimate of the effect of class size. 
Specifically, by including percent English learners in the regression, we were able to 
estimate the effect of class size, controlling for the percent English learners. 

In this section, we make explicit the distinction between a regressor for which we 
wish to estimate a causal effect—that is, a variable of interest—and control variables. 
A control variable is not the object of interest in the study; rather, it is a regressor 
included to hold constant factors that, if neglected, could lead the estimated causal 
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effect of interest to suffer from omitted variable bias. This distinction leads to a modi- 
fication of the first least squares assumption in Key Concept 6.4, in which some of 
the variables are control variables. If this alternative assumption holds, the OLS esti- 
mator of the effect of interest is unbiased, but the OLS coefficients on control vari- 
ables are, in general, biased and do not have a causal interpretation. 

For example, consider the potential omitted variable bias arising from omitting 
outside learning opportunities from a test score regression. Although “outside learn- 
ing opportunities” is a broad concept that is difficult to measure, those opportunities 
are correlated with the students’ economic background, which can be measured. Thus 
a measure of economic background can be included in a test score regression to 
control for omitted income-related determinants of test scores, like outside learning 
opportunities. To this end, we augment the regression of test scores on STR and 
PctEL with the percentage of students receiving a free or subsidized school lunch 
(LchPct). Students are eligible for this program if their family income is less than a certain 
threshold (approximately 150% of the poverty line), so LchPct measures the fraction of 
economically disadvantaged children in the district. The estimated regression is 


Bee ee. 
TestScore = 700.2 — 1.00 X STR — 0.122 X PctEL — 0.547 X LchPct. (6.16) 


In this regression, the coefficient on the student-teacher ratio is the effect of the 
student-teacher ratio on test scores, controlling for the percentage of English learn- 
ers and the percentage eligible for a reduced-price lunch. Including the control 
variable LchPct does not substantially change any conclusions about the class size 
effect: The coefficient on STR changes only slightly from its value of —1.10 in Equa- 
tion (6.12) to —1.00 in Equation (6.16). 

What does one make of the coefficient on LchPct in Equation (6.16)? That coef- 
ficient is very large: The difference in test scores between a district with LchPct = 0% 
and one with LchPct = 50% is estimated to be 274 points [= 0.547 x (50 — 0)], 
approximately the difference between the 75th and 25th percentiles of test scores in 
Table 4.1. Does this coefficient have a causal interpretation? Suppose that upon see- 
ing Equation (6.16) the superintendent proposed eliminating the reduced-price 
lunch program so that, for her district, LchPct would immediately drop to 0. Would 
eliminating the lunch program boost her district’s test scores? Common sense sug- 
gests that the answer is no; in fact, by leaving some students hungry, eliminating the 
reduced-price lunch program might well have the opposite effect. But does it make 
sense to treat as causal the coefficient on the variable of interest STR but not the 
coefficient on the control variable LchPct? 


Control Variables and Conditional Mean 
Independence 


To distinguish between variables of interest and control variables, we modify the 
notation of the linear regression model to include k variables of interest, denoted by 
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The Least Squares Assumptions for Causal Inference 
in the Multiple Regression Model with Control Variables 6.6 


M6 = [Eo ar Ee ar 28? ar font ap a aa ar 8 o8 Sr [Shae or Wat Myo oo oll) 


where ,.. 


. , Bk are causal effects; the W’s are control variables; and 


1. u; has a conditional mean that does not depend on the X’s given the W’s; that is, 


EON 


E(u;| Xj, oe , Xii, Wii, thn , Wi) = E(u;| Wi, Pe , Wa) 


(conditional mean independence). (6.17) 


<> Xki Wii - - - , Wp Y), i = 1, ...,n, are independently and identically 


distributed (i.i.d.) draws from their joint distribution. 


3. Large outliers are unlikely: X;,..., Xk Wi;,...,W,;, and Y; have nonzero 


finite fourth moments. 


4. There is no perfect multicollinearity. 


X, and r control variables, denoted by W. Accordingly, the multiple regression model 
with control variables is 


Y; = Po + BiXui Feo + BkXki + Br+Wii +- -+ Beg Wat upi =1,...,7n. (6.18) 


The coefficients on the X’s, 64, . . . , Bk, are causal effects of interest. 

The reason for including control variables in multiple regression is to make the 
variables of interest no longer correlated with the error term, once the control vari- 
ables are held constant. This idea is made precise by replacing assumption 1 in Key 
Concept 6.4 with an assumption called conditional mean independence. Conditional 
mean independence requires that the conditional expectation of u; given the variable 
of interest and the control variables does not depend on (is independent of) the vari- 
able of interest, although it can depend on control variables. 

The least squares assumptions for causal inference with control variables are 
summarized in Key Concept 6.6. The first of these assumptions is a mathematical 
statement of the conditional mean independence requirement. The remaining three 
assumptions are extensions of their counterparts in Key Concept 6.4. 

The idea of conditional mean independence is that once you control for the W’s, 
the X’s can be treated as if they were randomly assigned, in the sense that the condi- 
tional mean of the error term no longer depends on X. Controlling for W makes the 
X’s uncorrelated with the error term, so that OLS can estimate the causal effects on 
Y of a change in each of the X’s. The control variables, however, remain correlated 
with the error term, so the coefficients on the control variables are subject to omitted 
variable bias and do not have a causal interpretation. The mathematics of this 
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6.9 


interpretation is laid out in Appendix 6.5, where it is shown that if conditional mean 
independence holds, then the OLS estimators of the coefficients on the X’s are unbi- 
ased estimators of the causal effects of the X’s, but the OLS estimators of the coef- 
ficients on the W’s are in general biased. This bias does not pose a problem because 
we are interested in the coefficients on the X’s, not on the W’s. 

In the class size example, LchPct can be correlated with factors, such as learn- 
ing opportunities outside school, that enter the error term; indeed, it is because of 
this correlation that LchPct is a useful control variable. This correlation between 
LchPct and the error term means that the estimated coefficient on LchPct does 
not have a causal interpretation. What the conditional mean independence 
assumption requires is that, given the control variables in the regression (PctEL 
and LchPct), the mean of the error term does not depend on the student-teacher 
ratio. Said differently, conditional mean independence says that among schools 
with the same values of PctEL and LchPct, class size is “as-if” randomly assigned: 
Including PctEL and LchPct in the regression controls for omitted factors so that 
STR is uncorrelated with the error term. If so, the coefficient on the student- 
teacher ratio has a causal interpretation even though the coefficient on LchPct 
does not. 

The first least squares assumption for multiple regression with control variables 
makes precise the requirement needed to eliminate the omitted variable bias with which 
this chapter began: Given, or holding constant, the values of the control variables, the 
variable of interest is as-if randomly assigned in the sense that the mean of the error 
term no longer depends on X given the control variables. This requirement serves as a 
useful guide for choosing of control variables and for judging their adequacy. 


Conclusion 


Regression with a single regressor is vulnerable to omitted variable bias: If an omitted 
variable is a determinant of the dependent variable and is correlated with the regres- 
sor, then the OLS estimator of the causal effect will be biased and will reflect both 
the effect of the regressor and the effect of the omitted variable. Multiple regression 
makes it possible to mitigate or eliminate omitted variable bias by including the omit- 
ted variable in the regression. The coefficient on a regressor, X4, in multiple regres- 
sion is the partial effect of a change in X;, holding constant the other included 
regressors. In the test score example, including the percentage of English learners as 
a regressor made it possible to estimate the effect on test scores of a change in 
the student-teacher ratio, holding constant the percentage of English learners. Doing 
so reduced by half the estimated effect on test scores of a change in the student- 
teacher ratio. 

The statistical theory of multiple regression builds on the statistical theory of 
regression with a single regressor. The least squares assumptions for multiple regres- 
sion are extensions of the three least squares assumptions for regression with a single 
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regressor, plus a fourth assumption ruling out perfect multicollinearity. Because the 


regression coefficients are estimated using a single sample, the OLS estimators have 


a joint sampling distribution and therefore have sampling uncertainty. This sampling 


uncertainty must be quantified as part of an empirical study, and the ways to do so 


in the multiple regression model are the topic of the next chapter. 


Summary 


1. 


Omitted variable bias occurs when an omitted variable (a) is correlated with 
an included regressor and (b) is a determinant of Y. 


2. The multiple regression model is a linear regression model that includes 
multiple regressors, X, X, ... , X;. Associated with each regressor is a regres- 
sion coefficient, 64, b2, ..., Bk: The coefficient B; is the expected difference 
in Y associated with a one-unit difference in X4, holding the other regressors 
constant. The other regression coefficients have an analogous interpretation. 

3. The coefficients in multiple regression can be estimated by OLS. When the four 
least squares assumptions in Key Concept 6.4 are satisfied, the OLS estimators 
of the causal effect are unbiased, consistent, and normally distributed in large 
samples. 

4. The role of control variables is to hold constant omitted factors so that the 
variable of interest is no longer correlated with the error term. Properly chosen 
control variables can eliminate omitted variable bias in the OLS estimate of 
the causal effect of interest. 

5. Perfect multicollinearity, which occurs when one regressor is an exact linear 
function of the other regressors, usually arises from a mistake in choosing 
which regressors to include in a multiple regression. Solving perfect multicol- 
linearity requires changing the set of regressors. 

6. The standard error of the regression, the R?, and the R? are measures of fit for 
the multiple regression model. 

Key Terms 

omitted variable bias (212) holding X, constant (218) 
multiple regression model (217) controlling for X, (218) 
population regression line (218) partial effect (219) 

population regression function (218) population multiple regression 
intercept (218) model (219) 

slope coefficient of Xj; (218) constant regressor (219) 
coefficient on Xj; (218) constant term (219) 

slope coefficient of Xz; (218) homoskedastic (219) 


coefficient on Xz; (218) heteroskedastic (219) 
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ordinary least squares (OLS) dummy variable trap (230) 
estimators of Bo, Bi, - - - , Be (220) imperfect multicollinearity (230) 
OLS regression line (220) control variable (231) 
predicted value (220) multiple regression model with control 
OLS residual (220) variables (233) 
R? (223) conditional mean independence (233) 


adjusted R?(R?) (223) 
perfect multicollinearity (226) 
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Review the Concepts 


6.1 


6.2 


6.3 


6.4 


6.5 


A researcher is estimating the effect of studying on the test scores of student’s 
from a private school. She is concerned, however, that she does not have infor- 
mation on the class size to include in the regression. What effect would the 
omission of the class size variable have on her estimated coefficient on the 
private school indicator variable? Will the effect of this omission disappear if 
she uses a larger sample of students? 


A multiple regression includes two regressors: Y; = By + BX; + BX; + uj. 
What is the expected change in Y if X, increases by 8 units and X, is 
unchanged? What is the expected change in Y if X, decreases by 3 units and 
X; is unchanged? What is the expected change in Y if X; increases by 4 units 
and X, decreases by 7 units? 


What are the measures of fit commonly used for multiple regressions? How 
can an adjusted R? take on negative values? 


What is a dummy variable trap? Explain how it is related to multicollinearity 
of regressor. What is the solution for this form of multicollinearity? 


How is imperfect collinearity of regressors different from perfect collinear- 
ity? Compare the solutions for these two concerns with multiple regression 
estimation. 
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Exercises 


The first four exercises refer to the table of estimated regressions on page 238, 
computed using data for 2015 from the Current Population Survey. The data set 
consists of information on 7178 full-time, full-year workers. The highest educational 
achievement for each worker was either a high school diploma or a bachelor’s degree. 
The workers’ ages ranged from 25 to 34 years. The data set also contains information 
on the region of the country where the person lived, marital status, and number of 


children. For the purposes of these exercises, let 


AHE = average hourly earnings 


College = binary variable (1 if college, 0 if high school) 


Female = binary variable (1 if female, 0 if male) 


Age = age (in years) 


Northeast = binary variable (1 if Region = Northeast, 0 otherwise) 


Midwest = binary variable (1 if Region = Midwest, 0 otherwise) 


South = binary variable (1 if Region = South, 0 otherwise) 


West = binary variable (1 if Region = West, 0 otherwise) 


6.1 
6.2 


6.3 


6.4 


6.5 


Compute R? for each of the regressions. 
Using the regression results in column (1): 


a. Do workers with college degrees earn more, on average, than workers 
with only high school diplomas? How much more? 


b. Do men earn more than women, on average? How much more? 
Using the regression results in column (2): 


a. Is age an important determinant of earnings? Explain. 


b. Sally is a 29-year-old female college graduate. Betsy is a 34-year-old 
female college graduate. Predict Sally’s and Betsy’s earnings. 


Using the regression results in column (3): 


a. Do there appear to be important regional differences? 


b. Why is the regressor West omitted from the regression? What would 
happen if it were included? 


c. Juanita is a 28-year-old female college graduate from the South. Jennifer 
is a 28-year-old female college graduate from the Midwest. Calculate the 


expected difference in earnings between Juanita and Jennifer. 


Data were collected from a random sample of 200 home sales from a com- 
munity in 2013. Let Price denote the selling price (in $1000s), BDR denote 
the number of bedrooms, Bath denote the number of bathrooms, Hsize denote 
the size of the house (in square feet), Lsize denote the lot size (in square feet), 
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Results of Regressions of Average Hourly Earnings on Sex and Education 
Binary Variables and Other Characteristics, Using 2015 Data from the 
Current Population Survey 


Dependent variable: average hourly earnings (AHE). 


Regressor (1) (2) (3) 

College (X1) 10.47 10.44 10.42 
Female (X2) —4.69 —4.56 —4.57 
Age (X3) 0.61 0.61 
Northeast (X4) 0.74 
Midwest (X;) —1.54 
South (X6) —0.44 
Intercept 18.15 0.11 0.33 


Summary Statistics 


SER 12.15 12.03 12.01 
R? 0.165 0.182 0.185 
R? 

L n 7178 7178 7178 


Age denote the age of the house (in years), and Poor denote a binary vari- 
able that is equal to 1 if the condition of the house is reported as “poor.” An 
estimated regression yields 


Price = 109.7 + 0.567BDR + 26.9Bath + 0.239Hsize + 0.005Lsize 
+ 0.1Age — 56.9Poor, R? = 0.85, SER = 45.8. 


a. Suppose that a homeowner converts part of an existing family room in 
her house into a new bathroom. What is the expected increase in the 
value of the house? 

b. Suppose that a homeowner adds a new bathroom to her house, which 
increases the size of the house by 80 square feet. What is the expected 
increase in the value of the house? 


c. What is the loss in value if a homeowner lets his house run down so that 
its condition becomes “poor”? 


d. Compute the R? for the regression. 


A researcher plans to study the causal effect of a strong legal system on the 
number of scandals in a country, using data from a random sample of coun- 
tries in Asia. The researcher plans to regress the number of scandals on how 
strong a legal system is in the countries (an indicator variable taking the value 
1 or 0, based on expert opinion). 


6.7 


6.8 


6.9 
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a. Do you think this regression suffers from omitted variable bias? Explain 
why. Which variables would you add to the regression? 


b. Using the expression for omitted variable bias given in Equation (6.1), 
assess whether the regression will likely over- or underestimate the 
effect of a strong legal system on the number of scandals in a country. 
That is, do you think that Bi > B, or Bi < B? 


Critique each of the following proposed research plans. Your critique should 
explain any problems with the proposed research and describe how the research 
plan might be improved. Include a discussion of any additional data that need to 
be collected and the appropriate statistical techniques for analyzing those data. 


a. A researcher wants to determine whether a leading global university 
is guilty of racial bias in admissions. To determine potential bias, the 
researcher collects data on the race of all applicants to the university 
for a given year. The researcher plans to conduct a difference-in-means 
test to determine whether the proportion of acceptances among Black 
candidates is systematically different from the proportion of acceptances 
among other candidates. 


b. A researcher is interested in identifying the impact of a mother’s 
education on the educational attainment of her child. She collects data 
on a random sample of individuals aged between 25 and 40 years who 
are out of the schooling system. The data set contains information on 
each person’s level of schooling, the type of school attended, gender and 
ethnicity, as well as information on the schooling of their parents and the 
demographic characteristics of the household in which they grew up. The 
researcher plans to regress years of schooling achieved by an individual 
on the years of schooling of their mother, including in the regression 
the other potential determinants of schooling (number of siblings and 
whether parents lived together or are separated) as controls. 


A government study found that people who eat chocolate frequently weigh 
less than people who don’t. Researchers questioned 1000 individuals from 
Cairo between the ages of 20 and 85 about their eating habits, and measured 
their weight and height. On average, participants ate chocolate twice a week 
and had a body mass index (BMI) of 28. There was an observed difference of 
five to seven pounds in weight between those who ate chocolate five times a 
week and those who did not eat any chocolate at all, with the chocolate eat- 
ers weighing less on average. Frequent chocolate eaters also consumed more 
calories, on average, than people who consumed less chocolate. Based on this 
summary, would you recommend that Egyptians who do not presently eat 
chocolate should consider eating chocolate up to five times a week if they 
want to lose weight? Why or why not? Explain. 


(Y, Xii X;) satisfy the assumptions in Key Concept 6.4. You are interested in 
Bı, the causal effect of X; on Y. Suppose X, and X, are uncorrelated. You esti- 
mate GB, by regressing Y onto X; (so that X is not included in the regression). 
Does this estimator suffer from omitted variable bias? Explain. 
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6.10 


6.11 


6.12 


(Y, Xi Xi) satisfy the assumptions in Key Concept 6.4; in addition, 
Xii X9;) = 4 and var(Xj;) = 6. A random sample of size n = 400 is 
drawn from the population. 


var (u; 


a. Assume that X; and X, are uncorrelated. Compute the variance of Bi. 
[ Hint: Look at Equation (6.20) in Appendix 6.2.] 


b. Assume that corr (X;, X,) = 0.5. Compute the variance of Bi. 


c. Comment on the following statements: “When X; and X, are correlated, 
the variance of Ê is larger than it would be if X; and X, were uncor- 
related. Thus, if you are interested in Aj, it is best to leave X, out of the 
regression if it is correlated with X4.” 


(Requires calculus) Consider the regression model 
Y, = BX + bX + u; 


fori = 1,...,n. (Notice that there is no constant term in the regression.) 
Following analysis like that used in Appendix 4.2: 


a. Specify the least squares function that is minimized by OLS. 

b. Compute the partial derivatives of the objective function with respect to 
b; and b>. 

c. Suppose that 5/_,X;X>; = 0. Show that B, = D/_.X1,Y;/ XX? 


d. Suppose that I 1X1iX2; # 0. Derive an expression for Bi as a function 


of the data (Y, Xii Xi), i = i. TETA 
e. Suppose that the model includes an intercept: Y, = By + BX; + BX; + u; 


Show that the least squares estimators satisfy Êo =Y- ÊX, z ÊX. 


f. Asin (e), suppose that the model contains an intercept. Also 
suppose that (Xi; — Xi) (Xo; — X) = 0. Show that 
By = Die (Xu — X) Y; — Y)/ Di (%i — X)”. How does this 
compare to the OLS estimator of £; from the regression that omits X? 


A school district undertakes an experiment to estimate the effect of class size 
on test scores in second-grade classes. The district assigns 50% of its previous 
year’s first graders to small second-grade classes (18 students per classroom) 
and 50% to regular-size classes (21 students per classroom). Students new 
to the district are handled differently: 20% are randomly assigned to small 
classes and 80% to regular-size classes. At the end of the second-grade school 
year, each student is given a standardized exam. Let Y; denote the exam score 
for the i® student, X; denote a binary variable that equals 1 if the student is 
assigned to a small class, and W; denote a binary variable that equals 1 if the 
student is newly enrolled. Let 6, denote the causal effect on test scores of 
reducing class size from regular to small. 
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a. Consider the regression Y; = By + 61X; + u; Do you think that 
E(u;|X;) = 0? Is the OLS estimator of £, unbiased and consistent? 
Explain. 

b. Consider the regression Y; = By + B,X; + B&W; + u; Do you think that 
E(u;|X;,W;) depends on X;? Is the OLS estimator of 6; unbiased and 
consistent? Explain. Do you think that E(u;|X;,W;) depends on W,? Will 
the OLS estimator of 6, provide an unbiased and consistent estimate of 
the causal effect of transferring to a new school (that is, being a newly 
enrolled student)? Explain. 


Empirical Exercises 


(Only two empirical exercises for this chapter are given in the text, but you can find 
more on the text website, http://www.pearsonglobaleditions.com. ) 


E6.1 Use the Birthweight_Smoking data set introduced in Empirical Exercise E5.3 
to answer the following questions. 


a. Regress Birthweight on Smoker. What is the estimated effect of smoking 
on birth weight? 


b. Regress Birthweight on Smoker, Alcohol, and Nprevist. 


i. Using the two conditions in Key Concept 6.1, explain why the 
exclusion of Alcohol and Nprevist could lead to omitted variable bias 
in the regression estimated in (a). 


ii. Is the estimated effect of smoking on birth weight substantially 
different from the regression that excludes Alcohol and Nprevist? 
Does the regression in (a) seem to suffer from omitted variable bias? 


iii. Jane smoked during her pregnancy, did not drink alcohol, and had 8 
prenatal care visits. Use the regression to predict the birth weight of 
Jane’s child. 


iv. Compute R? and R°. Why are they so similar? 


v. How should you interpret the coefficient on Nprevist? Does the 
coefficient measure a causal effect of prenatal visits on birth weight? 
If not, what does it measure? 


c. Estimate the coefficient on Smoking for the multiple regression model 
in (b), using the three-step process in Appendix 6.3 (the Frisch-Waugh 
theorem). Verify that the three-step process yields the same estimated 
coefficient for Smoking as that obtained in (b). 

d. An alternative way to control for prenatal visits is to use the binary 
variables Tripre0 through Tripre3. Regress Birthweight on Smoker, 
Alcohol, Tripre0, Tripre2, and Tripre3. 
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i. Why is Triprel excluded from the regression? What would happen if 
you included it in the regression? 


ii. The estimated coefficient on Tripre0 is large and negative. What does 
this coefficient measure? Interpret its value. 


iii. Interpret the value of the estimated coefficients on Tripre2 and Tripre3. 


iv. Does the regression in (d) explain a larger fraction of the variance in 
birth weight than the regression in (b)? 


E6.2 Using the data set Growth described in Empirical Exercise E4.1, but exclud- 
ing the data for Malta, carry out the following exercises. 


a. Construct a table that shows the sample mean, standard deviation, 
and minimum and maximum values for the series Growth, TradeShare, 
YearsSchool, Oil, Rev_Coups, Assassinations, and RGD P60. Include the 
appropriate units for all entries. 


b. Run a regression of Growth on TradeShare, YearsSchool, Rev_Coups, 
Assassinations, and RGD P60. What is the value of the coefficient on 
Rev_Coups? Interpret the value of this coefficient. Is it large or small in 
a real-world sense? 


c. Use the regression to predict the average annual growth rate for a 
country that has average values for all regressors. 

d. Repeat (c), but now assume that the country’s value for TradeShare is 
one standard deviation above the mean. 


e. Why is Oil omitted from the regression? What would happen if it were 
included? 


Derivation of Equation (6.1) 


This appendix presents a derivation of the formula for omitted variable bias in Equation (6.1). 


Equation (4.28) in Appendix 4.3 states 


15%- Xu 
Nd ae (6.19) 
ae AP 
i=] 


Under the last two assumptions in Key Concept 4.3, (1/n)>j_\(X; — X)? “> o% and 
(1/n)S}-1(X — X)u; > cov(u;, X) = px,o,ox. Substitution of these limits into 
Equation (6.19) yields Equation (6.1). 
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Distribution of the OLS Estimators 
When There Are Two Regressors 
and Homoskedastic Errors 


Although the general formula for the variance of the OLS estimators in multiple regression is 
complicated, if there are two regressors (k = 2) and the errors are homoskedastic, then the 
formula simplifies enough to provide some insights into the distribution of the OLS 
estimators. 

Because the errors are homoskedastic, the conditional variance of u; can be written as 
var (u; | Xin Xi) = a2. When there are two regressors, X;; and_X>;,and the error term is homo- 
skedastic, in large samples the sampling distribution of Bi is N (Bi, 3, ), where the variance of 


this distribution, oZ, is 


2 
= aaa sa 


2 2°? 
HNL PX,,xX,7 FX, 


where py, x, is the population correlation between the two regressors X; and X, and ox, is the 
population variance of X. 

The variance o%, of the sampling distribution of Bi depends on the squared correlation 
between the regressors. If X, and X, are highly correlated, either positively or negatively, then 
PX% is close to 1, so the term 1 — PX,,X in the denominator of Equation (6.20) is small and 
the variance of Bi is larger than it would be if Px,, x, were close to 0. 

Another feature of the joint normal large-sample distribution of the OLS estimators is that 
Bi and Ê are, in general, correlated. When the errors are homoskedastic, the correlation between 
the OLS estimators Bi and Ê is the negative of the correlation between the two regressors (see 
Exercise 19.18): 


corr(B), By) = =Px, Xy (6.21) 


The Frisch-Waugh Theorem 


The OLS estimator in multiple regression can be computed by a sequence of shorter 
regressions. Consider the multiple regression model in Equation (6.7). The OLS estimator of 
B, can be computed in three steps: 

1. Regress X; on X, X3,..., Xk, and let xX denote the residuals from this regression; 

2. Regress Y on X, X3,... , Xk, and let ¥ denote the residuals from this regression; and 


3. Regress Ÿ on x: 
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where the regressions include a constant term (intercept). The Frisch-Waugh theorem states 
that the OLS coefficient in step 3 equals the OLS coefficient on X; in the multiple regression 
model [Equation (6.7)]. 

This result provides a mathematical statement of how the multiple regression coefficient 
Bi estimates the effect on Y of X, controlling for the other X’s: Because the first two regres- 
sions (steps 1 and 2) remove from Y and X their variation associated with the other X’s, the 
third regression estimates the effect on Y of X; using what is left over after removing (control- 
ling for) the effect of the other X’s. The Frisch-Waugh theorem is proven in Exercise 19.17 

This theorem suggests how Equation (6.20) can be derived from Equation (5.27). Because 
By is the OLS regression coefficient from the regression of ¥ onto X, Equation (5.27) suggests 

2 oi 


that the homoskedasticity-only variance of Bi iso A= oor where oF is the variance of Xi: 
nog 
XxX, 


Because X, is the residual from the regression of X, onto X, (recall that Equation (6.20) per- 


tains to the model with k = 2 regressors), Equation (6.15) implies that s% = (1 — Ry, x,)sx,. 
where Ry, x, 18 the adjusted R? from the regression of X; onto Xz. Equation (6.20) follows from 


2 R 2 R2 P 2 2 P 2 
S > oy, Rx x, > PX, X» and SX, > OX 


The Least Squares Assumptions for 
Prediction with Multiple Regressors 


This appendix extends the least squares assumptions for prediction with a single regressor in 
Appendix 4.4 to multiple regressors. It then discusses the unbiasedness of the OLS estimator 
of the population regression line and the unbiasedness of the forecasts. 

Adopt the notation of the least square assumptions for prediction with a single regressor 
in Appendix 4.4, so that the out-of-sample (“oos”) observation is (X9%, . .. , X2, Y °°). The 
aim is to predict Y°” given X9°,..., X}. Let (Xi,..., Xkb Y; ) i = 1,..., n, be the data 
used to estimate the regression coefficients. The least squares assumptions for prediction with 


multiple regressors are 


E(Y|X,- -., Xk) = Bo + BX +++ + BX, andu = Y — E(Y|X,..., Xk), where 
1. (X9%,..., XR, Y°"%) are randomly drawn from the same population distribution as 
(Mi... Xa ht = 1,...,0. 
2. (Xis. -< Xi Ý ) i = 1,...,n, are iid. draws from their joint distribution. 
3. Large outliers are unlikely: Xj;,..., X;; and Y; have nonzero finite fourth moments. 
4. There is no perfect multicollinearity. 


As in the case of a single X in Appendix 4.4, for prediction the f’s are defined to be the 
coefficients of the population conditional expectation. These 6’s may or may not have a causal 
interpretation. Assumption 1 ensures that this conditional expectation, estimated using the 


in-sample data, is the same as the conditional expectation that applies to the out-of-sample 
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prediction observation. The remaining assumptions are technical assumptions that play the 
same role as they do for causal inference. 


Under the definition that the p’s are the coefficients of the linear conditional expectation, 


the error u necessarily has a conditional mean of 0, so that E(u;|Xj;,..., X;;) = 0. Thus the 
calculations in Chapter 19 show that the OLS estimators Bo. Bi, bis Br are unbiased for the 


respective population slope coefficients. Under the additional technical conditions of assump- 
tions 2—4, the OLS estimators are consistent for these conditional expectation slope coeffi- 
cients and are normally distributed in large samples. 

The unbiasedness of the out-of-sample forecast follows from the unbiasedness of the OLS 
estimators and the first prediction assumption, which ensures that the out-of-sample observa- 
tion and in-sample observations are independently drawn from the same distribution. 


Specifically, 


E( yoes |x9es = x9, Teny x” = xg) 


= E( By + BXG + + ÊX IXI = xP", XG = xg) 
= E(Â|X” = 19%, . . . , XQ” = 1g”) + E(ÊXI” a AOS) 
fore E(B XP" | XE" = xh, XE = x2") 
= Bo + Bxt? +--+ + Birk” 
5 E(Y|X? = Beha ao Gi = xe), (6.22) 


where the third equality follows from the independence of the out-of-sample and in-sample 
observations and from the unbiasedness of the OLS estimators for the population slope coef- 
ficients of the in-sample conditional expectation, and where the final equality follows from the 


in- and out-of-sample observations being drawn from the same distribution. 


Distribution of OLS Estimators in Multiple 
Regression with Control Variables 


This appendix shows that under least squares assumption 1 for multiple regression with con- 
trol variables [Equation (6.18)], the OLS coefficient estimator is unbiased for the causal effect 
of the variables of interest. Moreover, with the addition of technical assumptions 2—4 in Key 
Concept 6.6, the OLS estimator is a consistent estimator of the causal effect and has a normal 
distribution in large samples. The OLS estimator of the coefficients on the control variables 
estimates the slope coefficient in a conditional expectation and is normally distributed in large 
samples around that slope coefficient; however, that slope coefficient does not, in general, have 
a causal interpretation. 

As we have throughout, assume that conditional expectations are linear, so that the con- 


ditional mean independence assumption is 


E(ui| Xii <- -> Xk Wii -< - Wa) = E(u;|Win....Wi) = Yo + Wi +*+ + veWe (6.23) 
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where the y’s are coefficients. Then the conditional expectation of Y; is 


E(¥;| Xi see s Xki Wii - oe Wii) 


= E( By + Bide ++ + BM + Br+1Wii + +00 Be We + i| Xin o- -o Xk Wii - cas We) 

= Po + BX +--+ + BX + Bk+1Wii +00 + Ber Wa + E(ui| Xin -o - , Xk Wiis- Wai) 

= (Bo + Yo) + BX +++ + BX + (Besa + y1) Wii to + (Bear + Yr) Wii 

= 69 + BX +t + BX + Wi + + + OW, (6.24) 


where the first equality uses Equation (6.17), the second equality distributes the conditional 
expectation, the third equality uses Equation (6.23), and the fourth equality defines 
ôo = Bo + Yo and 6; = Bk+j oF Yag = ETE 

It follows from Equation (6.24) that we can rewrite the multiple regression model with 


control variables as 


Y = ôo + BX tos + BMG + Wi +--+ + 8 Wai + vi, (6.25) 


where the error term v; has a conditional mean of 0: E (v;| Xis - <- , XWin... Wa) = 0. Thus, 
for this rewritten regression, the least squares assumptions in Key Concept 6.4 apply, with the 
reinterpretation of the coefficients as being those of Equation (6.24). 

Three conclusions follow from the rewritten form of the multiple regression model with 
control variables given in Equation (6.25). First, OLS provides unbiased estimators for the 8’s 
and 6’s in Equation (6.25), and under the additional assumptions 2—4 of Key Concept 6.6, the 
OLS estimators are consistent and have a normal distribution in large samples. Second, under 
the conditional mean independence assumption, the OLS estimators of the coefficients on the 
X’s have a causal interpretation; that is, they are unbiased for the causal effects B,,... , By. 
Third, the coefficients on the control variables do not, in general, have a causal interpretation. 
The reason is that those coefficients estimate any direct causal effect of the control variables, 
plus a term (the y’s) arising because of correlation between u; and the control variable. Thus, 
under conditional mean independence, the OLS estimator of the coefficients on the control 
variables, in general, suffer from omitted variable bias, even though the coefficients on the 


variables of interest do not. 


Hypothesis Tests and 


7.1 


Confidence Intervals 
in Multiple Regression 


s discussed in Chapter 6, multiple regression analysis provides a way to mitigate 
Aire problem of omitted variable bias by including additional regressors, thereby 
controlling for the effects of those additional regressors. The coefficients of the multi- 
ple regression model can be estimated by OLS. Like all estimators, the OLS estimator 
has sampling uncertainty because its value differs from one sample to the next. 

This chapter presents methods for quantifying the sampling uncertainty of the 
OLS estimator through the use of standard errors, statistical hypothesis tests, and 
confidence intervals. One new possibility that arises in multiple regression is a 
hypothesis that simultaneously involves two or more regression coefficients. The 
general approach to testing such “joint” hypotheses involves a new test statistic, the 
F-statistic. 

Section 7.1 extends the methods for statistical inference in regression with a single 
regressor to multiple regression. Sections 7.2 and 7.3 show how to test hypotheses 
that involve two or more regression coefficients. Section 7.4 extends the notion of 
confidence intervals for a single coefficient to confidence sets for multiple coefficients. 
Deciding which variables to include in a regression is an important practical issue, so 
Section 7.5 discusses ways to approach this problem. In Section 7.6, we apply multiple 
regression analysis to obtain improved estimates of the causal effect on test scores of a 
reduction in the student-teacher ratio using the California test score data set. 


Hypothesis Tests and Confidence Intervals 
for a Single Coefficient 


This section describes how to compute the standard error, how to test hypotheses, 
and how to construct confidence intervals for a single coefficient in a multiple regres- 
sion equation. 


Standard Errors for the OLS Estimators 


Recall that, in the case of a single regressor, it was possible to estimate the variance 
of the OLS estimator by substituting sample averages for expectations, which led to 
the estimator ô, given in Equation (5.4). Under the least squares assumptions, 
the law of large numbers implies that these sample averages converge to their 
: a2, 2 P a2. 
population counterparts, so, for example, a, /og, ——> 1.The square root of 4, is 
the standard error of Bi, SE(B:), an estimator of the standard deviation of the 
sampling distribution of £4. 
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All this extends directly to multiple regression. The OLS estimator Ê; of the j™® 
regression coefficient has a standard deviation, and this standard deviation is esti- 
mated by its standard error, SE(Ê;). The formula for the standard error is best 
stated using matrices (see Section 19.2). The important point is that, as far as stan- 
dard errors are concerned, there is nothing conceptually different between the 
single- and multiple-regressor cases. The key ideas — the large-sample normality of 
the estimators and the ability to estimate consistently the standard deviation of 
their sampling distribution —are the same whether there are one, two, or a dozen 
regressors. 


Hypothesis Tests for a Single Coefficient 


Suppose that you want to test the hypothesis that a change in the student-teacher 
ratio has no effect on test scores, holding constant the percentage of English learners 
in the district. This corresponds to hypothesizing that the true coefficient 6, on the 
student-teacher ratio is 0 in the population regression of test scores on STR and 
PctEL. More generally, we might want to test the hypothesis that the true coefficient 
6; on the j th regressor takes on some specific value, Bjo. The null value £; o comes 
either from economic theory or, as in the student-teacher ratio example, from the 
decision-making context of the application. If the alternative hypothesis is two-sided, 
then the two hypotheses can be written mathematically as 


Ay: Bi = Bio vs. Hi: B; ~ Bjo (two-sided alternative). (7.1) 


For example, if the first regressor is STR, then the null hypothesis that changing the 
student-teacher ratio has no effect on test scores corresponds to the null hypothesis 
that 6; = 0 (so Bio = 0). Our task is to test the null hypothesis Hp against the alter- 
native H; using a sample of data. 

Key Concept 5.2 gives a procedure for testing this null hypothesis when there is 
a single regressor. The first step in this procedure is to calculate the standard error of 
the coefficient. The second step is to calculate the t-statistic using the general formula 
in Key Concept 5.1. The third step is to compute the p-value of the test using the 
cumulative normal distribution in Appendix Table 1 or, alternatively, to compare 
the t-statistic to the critical value corresponding to the desired significance level of 
the test. The theoretical underpinnings of this procedure are that the OLS estimator 
has a large-sample normal distribution that, under the null hypothesis, has as its mean 
the hypothesized true value and that the variance of this distribution can be esti- 
mated consistently. 

These underpinnings are present in multiple regression as well. As stated in Key 
Concept 6.5, the sampling distribution of Bj is approximately normal. Under the null 
hypothesis, the mean of this distribution is 6,9. The variance of this distribution can 
be estimated consistently. Therefore we can simply follow the same procedure as in 
the single-regressor case to test the null hypothesis in Equation (7.1). 
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Testing the Hypothesis B; = jo 
Against the Alternative 6; # Bjo 7.1 


1. Compute the standard error of (i SE( B) 


2. Compute the t-statistic: 


3. Compute the p-value: 


acs 


E (7.2) 
SE(B;) 
p-value = 2®(— |£“ |), (7.3) 


where t^“ is the value of the t-statistic actually computed. Reject the hypothesis 


at the 5% significance level if the p-value is less than 0.05 or, equivalently, if 


eee 1.96. 


The standard error and (typically) the t-statistic and p-value testing 8; = 0 are 


computed automatically by regression software. 


The procedure for testing a hypothesis on a single coefficient in multiple regres- 
sion is summarized as Key Concept 7.1. The t-statistic actually computed is denoted 
t°“ in this box. However, it is customary to denote this simply as ¢, and we adopt this 
simplified notation for the rest of the book. 


Confidence Intervals for a Single Coefficient 


The method for constructing a confidence interval in the multiple regression model 
is also the same as in the single-regressor model. This method is summarized as 
Key Concept 7.2. 

The method for conducting a hypothesis test in Key Concept 7.1 and the method 
for constructing a confidence interval in Key Concept 7.2 rely on the large-sample 
normal approximation to the distribution of the OLS estimator 6. Accordingly, it 
should be kept in mind that these methods for quantifying the sampling uncertainty 
are only guaranteed to work in large samples. 


Application to Test Scores and the Student-Teacher Ratio 


Can we reject the null hypothesis that a change in the student-teacher ratio has no 
effect on test scores, once we control for the percentage of English learners in the 
district? What is a 95% confidence interval for the effect on test scores of a change 
in the student-teacher ratio, controlling for the percentage of English learners? We 
are now able to find out. The regression of test scores against STR and PctEL, 
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Confidence Intervals for a Single Coefficient 


7.2 


in Multiple Regression 


A 95% two-sided confidence interval for the coefficient 6; is an interval that con- 
tains the true value of 6; with a 95% probability; that is, it contains the true value 
of B; in 95% of all possible randomly drawn samples. Equivalently, it is the set of 
values of £; that cannot be rejected by a 5% two-sided hypothesis test. When the 
sample size is large, the 95% confidence interval is 


95% confidence interval for 8; = [Ê =L SE(Ĝ;), Ê; + 1.96 SE(B,) J]. (7.4) 


A 90% confidence interval is obtained by replacing 1.96 in Equation (7.4) 
with 1.64. 


estimated by OLS, was given in Equation (6.12) and is restated here with standard 
errors in parentheses below the coefficients: 


ee 
TestScore = 686.0 — 1.10 X STR — 0.650 X PctEL. (7.5) 
(8.7) (0.43) (0.031) 


To test the hypothesis that the true coefficient on STR is 0, we first need to compute 
the t-statistic in Equation (7.2). Because the null hypothesis says that the true value 
of this coefficient is 0, the t-statistic is £ = (—1.10 — 0) /0.43 = —2.54. The associ- 
ated p-value is 2®(— 2.54) = 1.1%; that is, the smallest significance level at which 
we can reject the null hypothesis is 1.1%. Because the p-value is less than 5%, the 
null hypothesis can be rejected at the 5% significance level (but not quite at the 1% 
significance level). 

A 95% confidence interval for the population coefficient on STR is 
—1.10 + 1.96 x 0.43 = (1.95, —0.26); that is, we can be 95% confident that the 
true value of the coefficient is between —1.95 and —0.26. Interpreted in the context 
of the superintendent’s interest in decreasing the student-teacher ratio by 2, 
the 95% confidence interval for the effect on test scores of this reduction is 
(-0.26 xX —2,-1.95 x —2) = (0.52, 3.90). 


Adding expenditures per pupil to the equation. Your analysis of the multiple regression 
in Equation (7.5) has persuaded the superintendent that, based on the evidence so far, 
reducing class size will improve test scores in her district. Now, however, she moves on 
to a more nuanced question. If she is to hire more teachers, she can pay for those teach- 
ers either by making cuts elsewhere in the budget (no new computers, reduced mainte- 
nance, and so on) or by asking for an increase in her budget, which taxpayers do not 
favor. What, she asks, is the effect on test scores of reducing the student-teacher ratio, 
holding expenditures per pupil (and the percentage of English learners) constant? 
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This question can be addressed by estimating a regression of test scores on the 
student-teacher ratio, total spending per pupil, and the percentage of English learn- 
ers. The OLS regression line is 


a ee 
TestScore = 649.6 — 0.29 X STR + 3.87 X Expn — 0.656 X PctEL, (7.6) 
(15.5) (0.48) (1.59) (0.032) 


where Expn is total annual expenditures per pupil in the district in thousands of 
dollars. 

The result is striking. Holding expenditures per pupil and the percentage of 
English learners constant, changing the student-teacher ratio is estimated to have a 
very small effect on test scores: The estimated coefficient on STR is —1.10 in 
Equation (7.5), but after adding Expn as a regressor in Equation (7.6), it is only 
—0.29. Moreover, the t-statistic for testing that the true value of the coefficient is 0 is 
now t = (—0.29 — 0)/0.48 = —0.60, so the hypothesis that the population value of 
this coefficient is indeed 0 cannot be rejected even at the 10% significance level 
(| — 0.60| < 1.64). Thus Equation (7.6) provides no evidence that hiring more teach- 
ers improves test scores if overall expenditures per pupil are held constant. 

One interpretation of the regression in Equation (7.6) is that,in these California data, 
school administrators allocate their budgets efficiently. Suppose, counterfactually, that the 
coefficient on STR in Equation (7.6) were negative and large. If so, school districts could 
raise their test scores simply by decreasing funding for other purposes (textbooks, tech- 
nology, sports, and so on) and using those funds to hire more teachers, thereby reducing 
class sizes while holding expenditures constant. However, the small and statistically insig- 
nificant coefficient on STR in Equation (7.6) indicates that this transfer would have little 
effect on test scores. Put differently, districts are already allocating their funds efficiently. 

Note that the standard error on STR increased when Expn was added, from 0.43 
in Equation (7.5) to 0.48 in Equation (7.6). This illustrates the general point, intro- 
duced in Section 6.7 in the context of imperfect multicollinearity, that correlation 
between regressors (the correlation between STR and Expn is —0.62) can make the 
OLS estimators less precise. 

What about our angry taxpayer? He asserts that the population values of both 
the coefficient on the student-teacher ratio (6,) and the coefficient on spending per 
pupil (£2) are 0; that is, he hypothesizes that both B, = 0 and 6 = 0. Although it 
might seem that we can reject this hypothesis because the t-statistic testing B, = 0 in 
Equation (7.6) is t = 3.87/1.59 = 2.43, this reasoning is flawed. The taxpayer’s 
hypothesis is a joint hypothesis, and to test it we need a new tool, the F-statistic. 


Tests of Joint Hypotheses 


This section describes how to formulate joint hypotheses on multiple regression 
coefficients and how to test them using an F-statistic. 


252 


CHAPTER7 Hypothesis Tests and Confidence Intervals in Multiple Regression 


Testing Hypotheses on Two or More Coefficients 


Joint null hypotheses. Consider the regression in Equation (7.6) of the test score 
against the student-teacher ratio, expenditures per pupil, and the percentage of 
English learners. Our angry taxpayer hypothesizes that neither the student-teacher 
ratio nor expenditures per pupil have an effect on test scores, once we control for the 
percentage of English learners. Because STR is the first regressor in Equation (7.6) 
and Expn is the second, we can write this hypothesis mathematically as 


Hy: B, = Oand 6, = 0 vs. Mı: B, ~ Oand/or B # 0. (7.7) 


The hypothesis that both the coefficient on the student-teacher ratio (64) and 
the coefficient on expenditures per pupil (£2) are 0 is an example of a joint hypothesis 
on the coefficients in the multiple regression model. In this case, the null hypothesis 
restricts the value of two of the coefficients, so as a matter of terminology we can say 
that the null hypothesis in Equation (7.7) imposes two restrictions on the multiple 
regression model: 6, = 0 and B, = 0. 

In general, a joint hypothesis is a hypothesis that imposes two or more restric- 
tions on the regression coefficients. We consider joint null and alternative hypotheses 


of the form 
Ho: B; = Bio, Bn = Bmo,-- +» for a total of q restrictions, vs. 
H: one or more of the q restrictions under Hp does not hold, (7.8) 
where G;, B,,,..., refer to different regression coefficients and Bo, Bno,..., refer to 


the values of these coefficients under the null hypothesis. The null hypothesis in 
Equation (7.7) is an example of Equation (7.8). Another example is that, in a regres- 
sion with k = 6 regressors, the null hypothesis is that the coefficients on the second, 
fourth, and fifth regressors are 0; that is, 62 = 0, By = 0,and B; = 0,so that there are 
q = 3 restrictions. In general, under the null hypothesis Hp, there are q such 
restrictions. 

If at least one of the equalities comprising the null hypothesis Hp in Equation (7.8) 
is false, then the joint null hypothesis itself is false. Thus the alternative hypothesis is 
that at least one of the equalities in the null hypothesis Hp does not hold. 


Why can’t I just test the individual coefficients one at a time? Although it seems it 
should be possible to test a joint hypothesis by using the usual f-statistics to test the 
restrictions one at a time, the following calculation shows that this approach is unreli- 
able. Specifically, suppose you are interested in testing the joint null hypothesis in 
Equation (7.6) that B; = 0 and £, = 0. Let t be the t-statistic for testing the null 
hypothesis that 8, = 0, and let t, be the t-statistic for testing the null hypothesis that 
B = 0. What happens when you use the “one-at-a-time” testing procedure: Reject 
the joint null hypothesis if either t; or t exceeds 1.96 in absolute value? 
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Because this question involves the two random variables t; and t, answering it 
requires characterizing the joint sampling distribution of t; and t). As mentioned in 
Section 6.6, in large samples, Bi and Ê have a joint normal distribution, so under the 
joint null hypothesis the t-statistics t+; and t, have a bivariate normal distribution, 
where each t-statistic has a mean equal to 0 and variance equal to 1. 

First, consider the special case in which the t-statistics are uncorrelated and thus 
are independent in large samples. What is the size of the one-at-a-time testing proce- 
dure; that is, what is the probability that you will reject the null hypothesis when it is 
true? More than 5%! In this special case, we can calculate the rejection probability 
of this method exactly. The null is not rejected only if both |t,;| = 1.96 and |f| = 1.96. 
Because the t-statistics are independent, Pr(|t;| = 1.96 and |t| = 1.96) = 
Pr(|t,| = 1.96) X Pr( |t| = 1.96) = 0.95? = 0.9025 = 90.25%. So the probability 
of rejecting the null hypothesis when it is true is 1 — 0.95? = 9.75%. This one-at-a- 
time method rejects the null too often because it gives you too many chances: If you 
fail to reject using the first t-statistic, you get to try again using the second. 

If the regressors are correlated, the situation is more complicated. The size of the 
one-at-a-time procedure depends on the value of the correlation between the regres- 
sors. Because the one-at-a-time testing approach has the wrong size — that is, its rejec- 
tion rate under the null hypothesis does not equal the desired significance level—a 
new approach is needed. 

One approach is to modify the one-at-a-time method so that it uses different 
critical values that ensure that its size equals its significance level. This method, called 
the Bonferroni method, is described in Appendix 7.1. The advantage of the Bonferroni 
method is that it applies very generally. Its disadvantage is that it can have low power: 
It frequently fails to reject the null hypothesis when, in fact, the alternative hypoth- 
esis is true. 

Fortunately, there is another approach to testing joint hypotheses that is more 
powerful, especially when the regressors are highly correlated. That approach is 
based on the F-statistic. 


The F-Statistic 


The F-statistic is used to test a joint hypothesis about regression coefficients. The 
formulas for the F-statistic are integrated into modern regression software. We first 
discuss the case of two restrictions then turn to the general case of q restrictions. 


The F-statistic with q = 2 restrictions. When the joint null hypothesis has the two 
restrictions that 8B, = 0 and B, = 0, the F-statistic combines the two t-statistics t4 and 
t, using the formula 


E He +- 2 (7.9) 


2 t= A 


where p, „is an estimator of the correlation between the two t-statistics. 
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To understand the F-statistic in Equation (7.9), first suppose we know that the 
t-statistics are uncorrelated, so we can drop the terms involving p, n. If so, Equation 
(7.9) simplifies, and F = 5(t7] + £3); that is, the F-statistic is the average of the 
squared f-statistics. Under the null hypothesis, t; and t, are independent standard 
normal random variables (because the t-statistics are uncorrelated by assumption), 
so under the null hypothesis F has an F, » distribution (Section 2.4). Under the alter- 
native hypothesis that either B; is nonzero or B is nonzero (or both), then either tî 
or £3 (or both) will be large, leading the test to reject the null hypothesis. 

In general, the ¢-statistics are correlated, and the formula for the F-statistic in 
Equation (7.9) adjusts for this correlation. This adjustment is made so that under the 
null hypothesis the F-statistic has an A » distribution in large samples whether or not 
the f-statistics are correlated. 


The F-statistic with q restrictions. The formula for the heteroskedasticity-robust 
F-statistic testing the q restrictions of the joint null hypothesis in Equation (7.8) is 
given in Section 19.3. This formula is incorporated into regression software, making 
the F-statistic easy to compute in practice. 

Under the null hypothesis, the F-statistic has a sampling distribution that, in large 
samples, is given by the F» distribution. That is, in large samples, under the null 
hypothesis 


the F-statistic is distributed F} ... (7.10) 


Thus the critical values for the F-statistic can be obtained from the tables of the F;,... 
distribution in Appendix Table 4 for the appropriate value of q and the desired 
significance level. 


Computing the heteroskedasticity-robust F-statistic in statistical software. If the 
F-statistic is computed using the general heteroskedasticity-robust formula, its large-n 
distribution under the null hypothesis is F} regardless of whether the errors are 
homoskedastic or heteroskedastic. As discussed in Section 5.4, for historical reasons, 
most statistical software computes homoskedasticity-only standard errors by default. 
Consequently, in some software packages you must select a “robust” option so that the 
F-statistic is computed using heteroskedasticity-robust standard errors (and, more 
generally, a heteroskedasticity-robust estimate of the “covariance matrix”). The 
homoskedasticity-only version of the F-statistic is discussed at the end of this section. 


Computing the p-value using the F-statistic. The p-value of the F-statistic can be 
computed using the large-sample F} approximation to its distribution. Let F°“ 
denote the value of the F-statistic actually computed. Because the F-statistic has a 
large-sample F} » distribution under the null hypothesis, the p-value is 


p-value = Pr[ F o > F**). (7.11) 
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The p-value in Equation (7.11) can be evaluated using a table of the F} » distribu- 
tion (or, alternatively, a table of the Xa distribution because a x;-distributed ran- 
dom variable is q times an F,.-distributed random variable). Alternatively, the 
p-value can be evaluated using a computer because formulas for the cumulative 
chi-squared and F distributions have been incorporated into most modern statisti- 
cal software. 


The overall regression F-statistic. The overall regression F-statistic tests the joint 
hypothesis that all the slope coefficients are 0. That is, the null and alternative hypoth- 
eses are 


Ay: Bi = 0, By =0,..., By = 0 vs. Hi: B; # 0, at least one j, j = 1,..., k. (7.12) 


Under this null hypothesis, none of the regressors explains any of the variation in Y, 
although the intercept (which under the null hypothesis is the mean of Y;) can be 
nonzero. The null hypothesis in Equation (7.12) is a special case of the general null 
hypothesis in Equation (7.8), and the overall regression F-statistic is the F-statistic 
computed for the null hypothesis in Equation (7.12). In large samples, the overall 
regression F-statistic has an F;,... distribution when the null hypothesis is true. 


The F-statistic when q = 1. When q = 1, the F-statistic tests a single restriction. 
Then the joint null hypothesis reduces to the null hypothesis on a single regression 
coefficient, and the F-statistic is the square of the t-statistic. 


Application to Test Scores 
and the Student-Teacher Ratio 


We are now able to test the null hypothesis that the coefficients on both the student- 
teacher ratio and expenditures per pupil are 0 against the alternative that at least one 
coefficient is nonzero, controlling for the percentage of English learners in the 
district. 

To test this hypothesis, we need to compute the heteroskedasticity-robust 
F-statistic testing the null hypothesis that 6, = 0 and f = 0 using the regression of 
TestScore on STR, Expn, and PctEL reported in Equation (7.6). This F-statistic is 
5.43. Under the null hypothesis, in large samples this statistic has an F,., distribution. 
The 5% critical value of the F » distribution is 3.00 (Appendix Table 4), and the 1% 
critical value is 4.61. The value of the F-statistic computed from the data, 5.43, exceeds 
4.61, so the null hypothesis is rejected at the 1% level. It is very unlikely that we 
would have drawn a sample that produced an F-statistic as large as 5.43 if the null 
hypothesis really were true (the p-value is 0.005). Based on the evidence in 
Equation (7.6) as summarized in this F-statistic, we can reject the taxpayer’s hypoth- 
esis that neither the student-teacher ratio nor expenditures per pupil have an effect 
on test scores (holding constant the percentage of English learners). 
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The Homoskedasticity-Only F-Statistic 


One way to restate the question addressed by the F-statistic is to ask whether relaxing 
the q restrictions that constitute the null hypothesis improves the fit of the regression 
by enough that this improvement is unlikely to be the result merely of random sam- 
pling variation if the null hypothesis is true. This restatement suggests that there is a 
link between the F-statistic and the regression R’: A large F-statistic should, it seems, 
be associated with a substantial increase in the R. In fact, if the error u;is homoskedastic, 
this intuition has an exact mathematical expression. Specifically, if the error term is 
homoskedastic, the F-statistic can be written in terms of the improvement in the fit of 
the regression as measured either by the decrease in the sum of squared residuals or 
by the increase in the regression R°. The resulting F-statistic is referred to as the 
homoskedasticity-only F-statistic because it is valid only if the error term is 
homoskedastic. In contrast, the heteroskedasticity-robust F-statistic computed using the 
formula in Section 19.3 (and reported above) is valid whether the error term is homo- 
skedastic or heteroskedastic. Despite this significant limitation of the homoskedasticity- 
only F-statistic, its simple formula sheds light on what the F-statistic is doing. In addition, 
the simple formula can be computed using standard regression output, such as might 
be reported in a table that includes regression R”s but not F-statistics. 

The homoskedasticity-only F-statistic is computed using a simple formula based 
on the sum of squared residuals from two regressions. In the first regression, called 
the restricted regression, the null hypothesis is forced to be true. When the null 
hypothesis is of the type in Equation (7.8), where all the hypothesized values are 0, 
the restricted regression is the regression in which those coefficients are set to 0; that 
is, the relevant regressors are excluded from the regression. In the second regression, 
called the unrestricted regression, the alternative hypothesis is allowed to be true. If 
the sum of squared residuals is sufficiently smaller in the unrestricted than in the 
restricted regression, then the test rejects the null hypothesis. 

The homoskedasticity-only F-statistic is given by the formula 


= ( SSRrestricted — SSRunrestricted ) / q 
a, > 
S SR unrestricted / (n ~ Kunrestricted =d ) 


(7.13) 


where SSRyestricteq 18 the sum of squared residuals from the restricted regression, 
SSRinrestricted 18 the sum of squared residuals from the unrestricted regression, q is the 
number of restrictions under the null hypothesis, and Kynrestricteq 18 the number of 
regressors in the unrestricted regression. An alternative equivalent formula for the 
homoskedasticity-only F-statistic is based on the R? of the two regressions: 


pz Ca = Rrestricted)/4 (7 14) 
(1 ~ Te econ) (i Kimreswricted ~~ 1) 


If the errors are homoskedastic, then the difference between the homoskedasticity- 
only F-statistic computed using Equation (7.13) or (7.14) and the heteroskedasticity- 
robust F-statistic vanishes as the sample size n increases. Thus, if the errors are 
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homoskedastic, the sampling distribution of the homoskedasticity-only F-statistic 
under the null hypothesis is, in large samples, F;.... 

These formulas are easy to compute and have an intuitive interpretation in terms 
of how well the unrestricted and restricted regressions fit the data. Unfortunately, the 
formulas apply only if the errors are homoskedastic. Because homoskedasticity is a 
special case that cannot be counted on in applications with economic data—or more 
generally with data sets typically found in the social sciences—in practice the 
homoskedasticity-only F-statisticis not asatisfactory substitute for the heteroskedasticity- 
robust F-statistic. 


Using the homoskedasticity-only F-statistic when n is small. If the errors are i.i.d., 
homoskedastic, and normally distributed, then the homoskedasticity-only F-statistic 
defined in Equations (7.13) and (7.14) has an Fon —&, sping -1 distribution under the 
null hypothesis (see Section 19.4). Critical values for this distribution, which depend 
on both q and n — Kynrestricted — 1, are given in Appendix Table 5. As discussed in 
Section 2.4, the Fn- kaenaa -1 distribution converges to the F} distribution as n 
increases; for large sample sizes, the differences between the two distributions are 
negligible. For small samples, however, the two sets of critical values differ. 


Application to test scores and the student-teacher ratio. To test the null hypothesis 
that the population coefficients on STR and Expn are 0, controlling for PctEL, we 
need to compute the R? (or SSR) for the restricted and unrestricted regressions. The 
unrestricted regression has the regressors STR, Expn, and PctEL and is given in 
Equation (7.6). Its R? is 0.4366; that is, RŽ „estricte = 0.4366. The restricted regression 
imposes the joint null hypothesis that the true coefficients on STR and Expn are 0; 
that is, under the null hypothesis STR and Expn do not enter the population regres- 
sion, although PctEL does (the null hypothesis does not restrict the coefficient on 
PctEL).The restricted regression, estimated by OLS, is 


— “= 
TestScore = 664.7 — 0.671 X PctEL, R? = 0.4149, (7.15) 
(1.0) (0.032) 


so R>osricted = 0.4149. The number of restrictions is q = 2, the number of observations 
ism = 420, and the number of regressors in the unrestricted regression is k = 3.The 
homoskedasticity-only F-statistic, computed using Equation (7.14), is 

(0.4366 — 0.4149) /2 


F= = 8.01. 
(1 — 0.4366) /(420 — 3 — 1) 


Because 8.01 exceeds the 1% critical value of 4.61, the hypothesis is rejected at the 
1% level using the homoskedasticity-only test. 

This example illustrates the advantages and disadvantages of the homoskedasticity- 
only F-statistic. An advantage is that it can be computed using a calculator. Its main 
disadvantage is that the values of the homoskedasticity-only and heteroskedasticity- 
robust F-statistics can be very different: The heteroskedasticity-robust F-statistic 


258 


CHAPTER7 Hypothesis Tests and Confidence Intervals in Multiple Regression 


72 


testing this joint hypothesis is 5.43, quite different from the less reliable homoskedasticity- 
only value of 8.01. 


Testing Single Restrictions Involving 
Multiple Coefficients 


Sometimes economic theory suggests a single restriction that involves two or more 
regression coefficients. For example, theory might suggest a null hypothesis of the 
form B, = fy; that is, the effects of the first and second regressors are the same. In 
this case, the task is to test this null hypothesis against the alternative that the two 
coefficients differ: 


A: By = Bo vs. Hy: By A Bo. (7.16) 


This null hypothesis has a single restriction,so q = 1, but that restriction involves mul- 
tiple coefficients (6, and B,). We need to modify the methods presented so far to test 
this hypothesis. There are two approaches; which is easier depends on your software. 


Approach 1: Test the restriction directly. Some statistical packages have a special- 
ized command designed to test restrictions like Equation (7.16), and the result is an 
F-statistic that, because q = 1, has an A «œ distribution under the null hypothesis. 
(Recall from Section 2.4 that the square of a standard normal random variable has 
an F » distribution, so the 95% percentile of the F,,.. distribution is 1.96* = 3.84.) 


Approach 2: Transform the regression. If your statistical package cannot test the restric- 
tion directly, the hypothesis in Equation (7.16) can be tested using a trick in which the 
original regression equation is rewritten to turn the restriction in Equation (7.16) into a 
restriction on a single regression coefficient. To be concrete, suppose there are only two 
regressors, X;; and Xz; in the regression, so the population regression has the form 


Y; = Bo + BX + BX + ui (7.17) 


Here is the trick: By subtracting and adding B,X);, we have that 6X; + BX; = 
BX — Xi + BX + BoXoi = (Bi — Bo)Xu + Bo (Xu + Xi) = Xi + BV; 
where yı = bı — fand V; = Xj; + X. Thus the population regression in Equation 
(7.17) can be rewritten as 


Y; = Bo + YX + BV; + ui (7.18) 


Because the coefficient y; in this equation is y; = fı — f2, under the null hypothesis 
in Equation (7.16) yı = 0, while under the alternative y4 # 0. Thus, by turning 
Equation (7.17) into Equation (7.18), we have turned a restriction on two regression 
coefficients into a restriction on a single regression coefficient. 


7.4 
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Because the restriction now involves the single coefficient y4, the null hypothesis 
in Equation (7.16) can be tested using the t-statistic method of Section 7.1. In prac- 
tice, this is done by first constructing the new regressor V; as the sum of the two origi- 
nal regressors, then estimating the regression of Y; on Xj; and V;. A 95% confidence 
interval for the difference in the coefficients 6; — B can be calculated as 
y, + 1.96 SE(¥,). 

This method can be extended to other restrictions on regression equations using 
the same trick (see Exercise 7.9). 

The two methods (approaches 1 and 2) are equivalent in the sense that the 
F-statistic from the first method equals the square of the t-statistic from the second 
method. 


Extension to q > 1. In general, it is possible to have q restrictions under the null 
hypothesis in which some or all of these restrictions involve multiple coefficients. The 
F-statistic of Section 7.2 extends to this type of joint hypothesis. The F-statistic can 
be computed by either of the two methods just discussed for g = 1. Precisely how 
best to do this in practice depends on the specific regression software being used. 


Confidence Sets for Multiple Coefficients 


This section explains how to construct a confidence set for two or more regression 
coefficients. The method is conceptually similar to the method in Section 7.1 for 
constructing a confidence set for a single coefficient using the t-statistic except that 
the confidence set for multiple coefficients is based on the F-statistic. 

A 95% confidence set for two or more coefficients is a set that contains the true 
population values of these coefficients in 95% of randomly drawn samples. Thus a 
confidence set is the generalization to two or more coefficients of a confidence inter- 
val for a single coefficient. 

Recall that a 95% confidence interval is computed by finding the set of values of 
the coefficients that are not rejected using a t-statistic at the 5% significance level. 
This approach can be extended to the case of multiple coefficients. To make this 
concrete, suppose you are interested in constructing a confidence set for two coeffi- 
cients, B, and $2. Section 7.2 showed how to use the F-statistic to test a joint null 
hypothesis that B, = B,9 and B, = 2o. Suppose you were to test every possible 
value of 6; o and 629 at the 5% level. For each pair of candidates (8; 9, 2,0), you com- 
pute the F-statistic and reject it if it exceeds the 5% critical value of 3.00. Because 
the test has a 5% significance level, the true population values of 6, and f will not 
be rejected in 95% of all samples. Thus the set of values not rejected at the 5% level 
by this F-statistic constitutes a 95% confidence set for B; and f2. 

Although this method of trying all possible values of £; ọ and fzo works in theory, 
in practice it is much simpler to use an explicit formula for the confidence set. This 
formula for the confidence set for an arbitrary number of coefficients is obtained 
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using the formula for the F-statistic given in Section 19.3. When there are two coef- 
ficients, the resulting confidence sets are ellipses. 

As an illustration, Figure 7.1 shows a 95% confidence set (confidence ellipse) for 
the coefficients on the student-teacher ratio and expenditures per pupil, holding 
constant the percentage of English learners, based on the estimated regression in 
Equation (7.6). This ellipse does not include the point (0, 0). This means that the null 
hypothesis that these two coefficients are both 0 is rejected using the F-statistic at the 
5% significance level, which we already knew from Section 7.2. The confidence 
ellipse is a fat sausage with the long part of the sausage oriented in the lower-left/ 
upper-right direction. The reason for this orientation is that the estimated correlation 
between ĝ; and ĝ, is positive, which in turn arises because the correlation between 
the regressors STR and Expn is negative (schools that spend more per pupil tend to 
have fewer students per teacher). 


Model Specification for Multiple Regression 


When estimating a causal effect, the job of determining which variables to include in 
multiple regression — that is, the problem of choosing a regression specification—can 
be quite challenging, and no single rule applies in all situations. But do not despair, 
because some useful guidelines are available. The starting point for choosing a regres- 
sion specification is thinking through the possible sources of omitted variable bias. It 
is important to rely on your expert knowledge of the empirical problem and to focus 
on obtaining an unbiased estimate of the causal effect of interest; do not rely primar- 
ily on purely statistical measures of fit such as the R? or R?. 
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Model Specification and Choosing Control Variables 


Multiple regression makes it possible to control for factors that could lead to omitted 
variable bias in the estimate of the effect of interest. But how does one determine the 
“right” set of control variables? 

At a general level, this question is answered by the conditional mean indepen- 
dence condition of Key Concept 6.5. That is, to eliminate omitted variables bias, a set 
of control variables must satisfy E(u;|X;,W;) = E(u;|W,), where X; denotes the vari- 
able or variables of interest and W; denotes one or more control variables. This condi- 
tion requires that, among observations with the same values of the control variables, 
the variable of interest is randomly assigned or as-if randomly assigned in the sense 
that the mean of u no longer depends on X. If this condition fails, then there remain 
omitted determinants of Y that are correlated with X, even after holding W constant, 
and the result is omitted variable bias. 

In practice, determining which control variables to include requires thinking 
through the application and using judgment. For example, economic conditions could 
vary substantially across school districts with the same percentage of English learn- 
ers. Because the budget of a school district depends in part on the affluence of the 
district, more affluent districts would be expected to have lower class sizes, even 
among districts with the same percentage of English learners. Moreover, more afflu- 
ent families tend to have more access to outside learning opportunities. If so, the 
affluence of the district satisfies the two conditions for omitted variable bias in 
Key Concept 6.1, even after controlling for the percentage of English learners. This 
logic leads to including one or more additional control variables in the test score 
regressions, where the additional control variables measure economic conditions of 
the district. 

Our approach to the challenge of choosing control variables is twofold. First, a 
core or base set of regressors should be chosen using a combination of expert judg- 
ment, economic theory, and knowledge of how the data were collected; the regression 
using this base set of regressors is sometimes referred to as a base specification. This 
base specification should contain the variables of primary interest and the control 
variables suggested by expert judgment and economic theory. Expert judgment and 
economic theory are rarely decisive, however, and often the variables suggested by 
economic theory are not the ones on which you have data. Therefore the next step is 
to develop a list of candidate alternative specifications —that is, alternative sets of 
regressors. If the estimates of the coefficients of interest are numerically similar 
across the alternative specifications, then this provides evidence that the estimates 
from your base specification are reliable. If, on the other hand, the estimates of the 
coefficients of interest change substantially across specifications, this often provides 
evidence that the original specification had omitted variable bias and heightens the 
concern that so might your alternative specifications. We elaborate on this approach 
to model specification in Section 9.2 after studying some additional tools for specify- 
ing regressions. 
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Interpreting the R? and the Adjusted R? in Practice 


An R? or an R? near 1 means that the regressors are good at predicting the values of 
the dependent variable in the sample, and an R? or an R? near 0 means that they are 
not. This makes these statistics useful summaries of the predictive ability of the 
regression. However, it is easy to read more into them than they deserve. 


There are four potential pitfalls to guard against when using the R? or R?: 


An increase in the R? or R? does not necessarily mean that an added vari- 
able is statistically significant. The R increases whenever you add a regressor, 
whether or not it is statistically significant. The R? does not always increase, 
but if it does, this does not necessarily mean that the coefficient on that added 
regressor is statistically significant. To ascertain whether an added variable is 
statistically significant, you need to perform a hypothesis test using the t-statistic. 


A high R? or R? does not mean that the regressors are a true cause of the depen- 
dent variable. Imagine regressing test scores against parking lot area per pupil. 
Parking lot area is correlated with the student-teacher ratio, with whether the 
school is in a suburb or a city, and possibly with district income — all things that are 
correlated with test scores. Thus the regression of test scores on parking lot area per 
pupil could have a high R? and R?, but the relationship is not causal (try telling the 
superintendent that the way to increase test scores is to increase parking space! ). 


A high R? or R? does not mean that there is no omitted variable bias. Recall 
the discussion of Section 6.1, which concerned omitted variable bias in the regres- 
sion of test scores on the student-teacher ratio. The R? of the regression was not 
mentioned because it played no logical role in this discussion. Omitted variable 
bias can occur in regressions with a low R?, a moderate R’, or a high R?. Con- 
versely, a low R? does not imply that there necessarily is omitted variable bias. 


A high R? or R? does not necessarily mean that you have the most appropriate 
set of regressors, nor does a low R? or R? necessarily mean that you have an 
inappropriate set of regressors. The question of what constitutes the right set of 
regressors in multiple regression is difficult, and we return to it throughout this 
textbook. Decisions about the regressors must weigh issues of omitted variable 
bias, data availability, data quality, and, most importantly, economic theory and 
the nature of the substantive questions being addressed. None of these ques- 
tions can be answered simply by having a high (or low) regression R? or R°. 


These points are summarized in Key Concept 7.3. 


7.6 Analysis of the Test Score Data Set 


This section presents an analysis of the effect on test scores of the student-teacher 
ratio using the California data set. This analysis illustrates how multiple regression 
analysis can be used to mitigate omitted variable bias. It also shows how to use a table 
to summarize regression results. 
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R? and R2: What They Tell You—and What They Don’t 
The R? and R? tell you whether the regressors are good at predicting, or 7.3 
“explaining,” the values of the dependent variable in the sample of data on hand. 

If the R? (or R?) is nearly 1, then the regressors produce good predictions of 

the dependent variable in that sample in the sense that the variance of the OLS 

residual is small compared to the variance of the dependent variable. If the R? (or 

R°) is nearly 0, the opposite is true. 


The R? and R? do NOT tell you whether 
1. An included variable is statistically significant, 
2. The regressors are a true cause of the dependent variable, 
3. There is omitted variable bias, or 


4. You have chosen the most appropriate set of regressors. 


Discussion of the base and alternative specifications. This analysis focuses on esti- 
mating the effect on test scores of a change in the student-teacher ratio, controlling 
for factors that otherwise could lead to omitted variable bias. Many factors poten- 
tially affect the average test score in a district. Some of these factors are correlated 
with the student-teacher ratio, so omitting them from the regression results in omit- 
ted variable bias. Because these factors, such as outside learning opportunities, are 
not directly measured, we include control variables that are correlated with these 
omitted factors. If the control variables are adequate in the sense that the conditional 
mean independence assumption holds, then the coefficient on the student-teacher 
ratio is the effect of a change in the student-teacher ratio, holding constant these 
other factors. Said differently, our aim is to include control variables such that, once 
they are held constant, the student-teacher ratio is as-if randomly assigned. 

Here we consider three variables that control for background characteristics of 
the students that could affect test scores: the fraction of students who are still learn- 
ing English, the percentage of students who are eligible to receive a subsidized or 
free lunch at school, and a new variable, the percentage of students in the district 
whose families qualify for a California income assistance program. Eligibility for this 
income assistance program depends in part on family income, with a higher (stricter) 
threshold than the subsidized lunch program. The final two variables thus are differ- 
ent measures of the fraction of economically disadvantaged children in the district 
(their correlation coefficient is 0.74). Theory and expert judgment do not tell us 
which of these two variables to use to control for determinants of test scores related 
to economic background. For our base specification, we use the percentage eligible 
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for a subsidized lunch, but we also consider an alternative specification that uses the 
fraction eligible for the income assistance program. 

Scatterplots of tests scores and these variables are presented in Figure 7.2. Each 
of these variables exhibits a negative correlation with test scores. The correlation 
between test scores and the percentage of English learners is —0.64, between test 
scores and the percentage eligible for a subsidized lunch is —0.87, and between test 
scores and the percentage qualifying for income assistance is —0.63. 


What scale should we use for the regressors? A practical question that arises in 
regression analysis is what scale you should use for the regressors. In Figure 7.2, the 
units of the variables are percentages, so the maximum possible range of the data is 
0 to 100. Alternatively, we could have defined these variables to be a decimal fraction 
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rather than a percentage; for example, PctEL could be replaced by the fraction of 
English learners, FracEL(= PctEL/100), which would range between 0 and 1 
instead of between 0 and 100. More generally, in regression analysis some decision 
usually needs to be made about the scale of both the dependent and the independent 
variables. How, then, should you choose the scale, or units, of the variables? 

The general answer to the question of choosing the scale of the variables is to 
make the regression results easy to read and to interpret. In the test score application, 
the natural unit for the dependent variable is the score of the test itself. In the regres- 
sion of TestScore on STR and PctEL reported in Equation (7.5), the coefficient on 
PctEL is —0.650. If instead the regressor had been FracEL, the regression would 
have had an identical R? and SER; however, the coefficient on FracEL would have 
been —65.0. In the specification with PctEL, the coefficient is the predicted change 
in test scores for a 1-percentage-point increase in English learners, holding STR con- 
stant; in the specification with FracEL, the coefficient is the predicted change in test 
scores for an increase by 1 in the fraction of English learners—that is, for a 
100-percentage-point-increase —holding STR constant. Although these two specifi- 
cations are mathematically equivalent, for the purposes of interpretation the one 
with PctEL seems, to us, more natural. 

Another consideration when deciding on a scale is to choose the units of the 
regressors so that the resulting regression coefficients are easy to read. For example, 
if a regressor is measured in dollars and has a coefficient of 0.00000356, it is easier 
to read if the regressor is converted to millions of dollars and the coefficient 3.56 is 
reported. 


Tabular presentation of result. We are now faced with a communication problem. 
What is the best way to show the results from several multiple regressions that con- 
tain different subsets of the possible regressors? So far, we have presented regression 
results by writing out the estimated regression equations, as in Equations (7.6) and 
(7.19). This works well when there are only a few regressors and only a few equations, 
but with more regressors and equations, this method of presentation can be confus- 
ing. A better way to communicate the results of several regressions is in a table. 

Table 7.1 summarizes the results of regressions of the test score on various sets 
of regressors. Each column presents a separate regression. Each regression has the 
same dependent variable, test score. The first row reports statistics that provide infor- 
mation about the causal effect of interest, the effect of the student-teacher ratio on 
test scores. The first entry is the OLS estimate, below which is its standard error (in 
parentheses). Below the standard error in brackets is a 95% confidence interval for 
the population coefficient. Although a reader could take out his or her calculator and 
compute the confidence interval from the estimate and its standard error, doing so is 
inconvenient, so the table provides this information for the reader. A reader inter- 
ested in testing the null hypothesis that the coefficient takes on some particular 
value, for example 0, at the 5% significance level can do so by checking whether that 
value is included in the 95% confidence interval. 
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LLII 7 Ail Results of Regressions of Test Scores on the Student-Teacher Ratio and Student 
Characteristic Control Variables Using California Elementary School Districts 


Dependent variable: average test score in the district. 


Regressor (1) (2) (3) (4) (5) 
Student-teacher ratio (X4) =2:28 =1.10 —1.00 =1.31 =1.01 
(0.52) (0.43) (0.27) (0.34) (0.27) 


[-3.30, -1.26] [-1.95, —0.25] [-1.53,—-0.47] [-1.97, —0.64] [-1.54,—0.49] 


Control variables 


Percentage English learners (X2) —0.650 —0.122 —0.488 —0.130 
(0.031) (0.033) (0.030) (0.036) 
Percentage eligible for subsidized —0.547 —0.529 
lunch (X3) (0.024) (0.038) 
Percentage qualifying for income —0.790 0.048 
assistance (X4) (0.068) (0.059) 
Intercept 698.9 686.0 700.2 698.0 700.4 
(10.4) (8.7) (5.6) (6.9) (5.5) 
Summary Statistics 
SER 18.58 14.46 9.08 11.65 9.08 
R2 0.049 0.424 0.773 0.626 0.773 
n 420 420 420 420 420 


These regressions were estimated using the data on K-8 school districts in California, described in Appendix 4.1. Heteroskedasticity- 
robust standard errors are given in parentheses under coefficients. For the variable of interest, the student-teacher ratio, the 95% 
confidence interval is given in brackets below the standard error. 
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The remaining variables are control variables and the constant term (intercept); 
for these, only the OLS estimate and its standard error are reported. Because the 
coefficients on the control variables do not, in general, have a causal interpretation, 
these coefficient estimates are often of limited independent interest, so no confi- 
dence interval is reported, although a reader who wants a confidence interval for one 
of those coefficients can compute it using the information provided. In cases in which 
there are many control variables, as there are in regressions later in this text, some- 
times a table will report no information at all about their coefficients or standard 
errors and will simply list the included control variables. Similarly, the value of the 
intercept often is of limited interest, so it, too, might not be reported. 

The final three rows contain summary statistics for the regression (the standard 
error of the regression, SER, and the R?) and the sample size (which is the same for 
all of the regressions, 420 observations). 

All the information that we have presented so far in equation format appears in 
this table. For example, consider the regression of the test score against the student- 
teacher ratio, with no control variables. In equation form, this regression is 


a — 
TestScore = 698.9 — 2.28 x STR, R? = 0.049, SER = 18.58,n = 420. (7.21) 
(10.4) (0.52) 
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All this information appears in column (1) of Table 7.1. The estimated coefficient on 
the student-teacher ratio (—2.28) appears in the first row of numerical entries, and 
its standard error (0.52) appears in parentheses just below the estimated coefficient. 
The table augments the information in Equation (7.21) by reporting the 95% confi- 
dence interval. The intercept (698.9) and its standard error (10.4) are given in the row 
labeled “Intercept.” (Sometimes you will see this row labeled “Constant” because, as 
discussed in Section 6.2, the intercept can be viewed as the coefficient on a regressor 
that is always equal to 1.) Similarly, the R (0.049), the SER (18.58), and the sample 
size n (420) appear in the final rows. The blank entries in the rows of the other regres- 
sors indicate that those regressors are not included in this regression. 

Although the table does not report f-statistics, they can be computed from the 
information provided; for example, the t-statistic testing the hypothesis that the coef- 
ficient on the student-teacher ratio in column (1) is 0 is —2.28/0.52 = —4.38. This 
hypothesis is rejected at the 1% level. 

Regressions that include the control variables measuring student characteristics 
are reported in columns (2) through (5). Column (2), which reports the regression of 
test scores on the student-teacher ratio and on the percentage of English learners, 
was previously stated as Equation (7.5). 

Column (3) presents the base specification, in which the regressors are the 
student-teacher ratio and two control variables, the percentage of English learners 
and the percentage of students eligible for a subsidized lunch. 

Columns (4) and (5) present alternative specifications that examine the effect 
of changes in the way the economic background of the students is measured. In 
column (4), the percentage of students qualifying for income assistance is included 
as a regressor, and in column (5), both of the economic background variables are 
included. 


Discussion of empirical results. These results suggest three conclusions: 


1. Controlling for these student characteristics cuts the estimated effect of the 
student-teacher ratio on test scores approximately in half. This estimated 
effect is not very sensitive to which specific control variables are included in 
the regression. In all cases, the hypothesis that the coefficient on the student- 
teacher ratio is 0 can be rejected at the 5% level. In the four specifications with 
control variables, regressions (2) through (5), reducing the student-teacher 
ratio by one student per teacher is estimated to increase average test scores by 
approximately 1 point, holding constant student characteristics. 

2. The student characteristic variables are potent predictors of test scores. The 
student-teacher ratio alone explains only a small fraction of the variation in 
test scores: The R? in column (1) is 0.049. The R? jumps, however, when the 
student characteristic variables are added. For example, the R? in the base 
specification, regression (3), is 0.773. The signs of the coefficients on the stu- 
dent demographic variables are consistent with the patterns seen in Figure 7.2: 
Districts with many English learners and districts with many poor children 
have lower test scores. 
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3. In contrast to the other two control variables, the percentage qualifying for 
income assistance appears to be redundant. As reported in regression (5), 
adding it to regression (3) has a negligible effect on the estimated coefficient 
on the student-teacher ratio or its standard error. 


Conclusion 


Chapter 6 began with a concern: In the regression of test scores against the student- 
teacher ratio, omitted student characteristics that influence test scores might be 
correlated with the student-teacher ratio in the district, and, if so, the student-teacher 
ratio in the district would pick up the effect on test scores of these omitted student 
characteristics. Thus the OLS estimator would have omitted variable bias. To mitigate 
this potential omitted variable bias, we augmented the regression by including variables 
that control for various student characteristics (the percentage of English learners and 
two measures of student economic background). Doing so cuts the estimated effect of 
a unit change in the student-teacher ratio in half, although it remains possible to reject 
the null hypothesis that the population effect on test scores, holding these control 
variables constant, is 0 at the 5% significance level. Because they eliminate omitted 
variable bias arising from these student characteristics, these multiple regression 
estimates, hypothesis tests, and confidence intervals are much more useful for advising 
the superintendent than are the single-regressor estimates of Chapters 4 and 5. 

The analysis in this and the preceding chapter has presumed that the population 
regression function is linear in the regressors—that is, that the conditional expecta- 
tion of Y; given the regressors is a straight line. There is, however, no particular reason 
to think this is so. In fact, the effect of reducing the student-teacher ratio might be 
quite different in districts with large classes than in districts that already have small 
classes. If so, the population regression line is not linear in the X’s but rather is a 
nonlinear function of the X’s. To extend our analysis to regression functions that are 
nonlinear in the X’s, however, we need the tools developed in the next chapter. 


Summary 


1. Hypothesis tests and confidence intervals for a single regression coefficient 
are carried out using essentially the same procedures used in the one-variable 
linear regression model of Chapter 5. For example, a 95% confidence interval 
for 6; is given by By + 1.96 SE(B,). 

2. Hypotheses involving more than one restriction on the coefficients are called 
joint hypotheses. Joint hypotheses can be tested using an F-statistic. 

3. Regression specification proceeds by first determining a base specification cho- 
sen to address concern about omitted variable bias. The base specification can be 
modified by including additional regressors that control for other potential sources 
of omitted variable bias. Simply choosing the specification with the highest R? can 
lead to regression models that do not estimate the causal effect of interest. 
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Key Terms 

restrictions (252) homoskedasticity-only F-statistic (256) 
joint hypothesis (252) 95% confidence set (259) 

F-statistic (253) base specification (261) 

restricted regression (256) alternative specifications (261) 
unrestricted regression (256) Bonferroni test (275) 
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7.2 


7.3 


What is a joint hypothesis? Explain how an F-statistic is constructed to 
test a joint hypothesis. What is the hypothesis that is tested by construct- 
ing the overall regression F-statistic in the multiple regression model 
Y; = Bo + BX; + BX; + u;? Explain using the concepts of restricted and 
unrestricted regressions. Why is it important for a researcher to have informa- 
tion on the distribution of the error terms when implementing these tests? 


Describe the recommended approach towards determining model specifica- 
tion. How does the R? help in determining an appropriate model? Is the ideal 
model the one with the highest R°? Should a regressor be included in the 
model if it increases the model R?? 


What is a control variable, and how does it differ from a variable of interest? 
Looking at Table 71, for what factors are the control variables controlling? 
Do coefficients on control variables measure causal effects? Explain. 


Exercises 


The first six exercises refer to the table of estimated regressions on page 270, com- 


puted using data on employees in a developing country. The data set consists of 


information on over 10,000 full-time, full-year workers. The highest educational 


achievement for each worker is either a high school diploma or a bachelor’s degree. 


The workers’ ages range from 25 to 40 years. The data set also contains information 


on the region of the country where the person lives, gender, and age. For the purposes 


of these exercises, let 
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AWE = logarithm of average weekly earnings (in 2007 units) 
High School = binary variable (1 if high school, 0 if less) 


Male = binary variable (1 if male, 0 if female) 


Age = (in years) 


North = binary variable (1 if Region = North, 0 otherwise) 
East = binary variable (1 if Region = East, 0 otherwise) 
South = binary variable (1 if Region = South, 0 otherwise) 


West = binary variable (1 if Region = West, 0 otherwise) 


Dependent variable: log average weekly earnings (AWE). 


Regressor (1) (2) (3) 
High school graduate (X;) 0.352 0.373 0.371 
(0.021) (0.021) (0.021) 
Male (X) 0.458 0.457 0.451 
(0.021) (0.020) (0.020) 
Age (X3) 0.011 0.011 
(0.001) (0.001) 
North (X4) 0.175 
(0.037) 
South (X53) 0.103 
(0.033) 
East (.X7) —0.102 
(0.043) 
Intercept 12.84 12.471 12.390 
(0.018) (0.049) (0.057) 
Summary Statistics and Joint Tests 
F-statistic for regional effects = 0 21.87 
SER 1.026 1.023 1.020 
R? 0.0710 0.0761 0.0814 
n 10973 10973 10973 
\ 


( Results of Regressions of Average Weekly Earnings on Gender and Education Binary Variables and 
Other Characteristics Using 2007 Data from a Developing Country Survey 


=y 


7.1 For each of the three regressions, add * (5% level) and ** (1% level) to the 


table to indicate the statistical significance of the coefficients. 


7.2 


7.3 


7.4 


7.5 


7.6 
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Using the regression results in column (1): 


a. Is the high school earnings difference estimated from this regression sta- 
tistically significant at the 5% level? Construct a 95% confidence inter- 
val of the difference. 


b. Is the male-female earnings difference estimated from this regression 
statistically significant at the 5% level? Construct a 95% confidence 
interval for the difference. 


Using the regression results in column (2): 


a. Is age an important determinant of earnings? Use an appropriate statis- 
tical test and/or confidence interval to explain your answer. 


b. Suppose Alvo is a 30-year-old male college graduate, and Kal is a 
40-year-old male college graduate. Construct a 95% confidence interval 
for the expected difference between their earnings. 


Using the regression results in column (3): 


a. Are there any important regional differences? Use an appropriate 
hypothesis test to explain your answer. 


b. Juan is a 32-year-old male high school graduate from the North. Mel is 
a 32-year-old male college graduate from the West. Ari is a 32-year-old 
male college graduate from the East. 


i. Construct a 95% confidence interval for the difference in expected 
earnings between Juan and Mel. 


ii. Explain how you would construct a 95% confidence interval for the 
difference in expected earnings between Juan and Ari. (Hint: What 
would happen if you included West and excluded East from the 
regression?) 


The regression shown in column (2) was estimated again, this time using data 
from 1993 (5000 observations selected at random and converted into 2007 
units using the Consumer Price Index). The results are 


jogAWE = 9.32 + 0.301 High school + 0.562 Male + 0.011Age, 
(0.20) (0.019) (0.047) (0.002) 


SER = 1.25, R? = 0.08 


Comparing this regression to the regression for 2012 shown in column (2), 
was there a statistically significant change in the coefficient on High school? 


In all of the regressions in the previous Exercises, the coefficient of High 
school is positive, large, and statistically significant. Do you believe this pro- 
vides strong statistical evidence of the high returns to schooling in the labor 
market? 
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7.7 Question 6.5 reported the following regression (where standard errors have 
been added): 


78 


7.9 


Price = 109.7 + 0.567BDR + 26.9Bath + 0.239Hsize + 0.005 Lsize 
(22.1) (1.23) (9.76) (0.021) (0.00072) 


+ 0.1Age — 56.9Poor, R? = 0.85, SER = 45.8. 
(0.23) (12.23) 


a. Is the coefficient on BDR statistically significantly different from zero? 


b. Typically, four-bedroom houses sell for more than three-bedroom houses. 


Is this consistent with your answer to (a) and with the regression more 
generally? 

A homeowner purchases 2500 square feet from an adjacent lot. Con- 
struct a 95% confident interval for the change in the value of her house. 


. Lot size is measured in square feet. Do you think that another scale 


might be more appropriate? Why or why not? 


. The F-statistic for omitting BDR and Age from the regression is 


F = 2.38. Are the coefficients on BDR and Age statistically different 
from zero at the 10% level? 


Referring to the Table on page 266 used for Exercises 7.1 to 76: 


a. Construct the R? for each of the regressions. 


b. Show how to construct the homoskedasticity-only F-statistic for testing 


d. 


Ba = Bs = Bo = Oin the regression shown in column (3). Is the statistic 
significant at the 1% level? 


Test By = Bs; = Bo = 0 in the regression shown in column (3) using the 
Bonferroni test discussed in Appendix 7.1. 


Construct a 99% confidence interval for £; for the regression in column (3). 


Consider the regression model Y; = By + BX; + BoX>; + u;i Use approach 2 
from Section 7.3 to transform the regression so that you can use a t-statistic to test 


a. 
b. 


Bı = fo. 
By + 2b2 =0. 


c. B, + & = 1. (Hint: You must redefine the dependent variable in the 


regression.) 


7.10 Equations (7.13) and (7.14) show two formulas for the homoskedasticity-only 
F-statistic. Show that the two formulas are equivalent. 


Empirical Exercises 


E71 _ Use the Birthweight_Smoking data set introduced in Empirical Exercise E5.3 
to answer the following questions. To begin, run three regressions: 
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(1) Birthweight on Smoker 
(2) Birthweight on Smoker, Alcohol, and Nprevist 
(3) Birthweight on Smoker, Alcohol, Nprevist, and Unmarried 


a. What is the value of the estimated effect of smoking on birth weight in 
each of the regressions? 


b. Construct a 95% confidence interval for the effect of smoking on birth 
weight, using each of the regressions. 


c. Does the coefficient on Smoker in regression (1) suffer from omitted 
variable bias? Explain. 


d. Does the coefficient on Smoker in regression (2) suffer from omitted 
variable bias? Explain. 


e. Consider the coefficient on Unmarried in regression (3). 
i. Construct a 95% confidence interval for the coefficient. 
ii. Is the coefficient statistically significant? Explain. 
iii. Is the magnitude of the coefficient large? Explain. 


iv. A family advocacy group notes that the large coefficient suggests 
that public policies that encourage marriage will lead, on average, to 
healthier babies. Do you agree? (Hint: Review the discussion of con- 
trol variables in Section 6.8. Discuss some of the various factors that 
Unmarried may be controlling for and how this affects the interpreta- 
tion of its coefficient.) 


f. Consider the various other control variables in the data set. Which do you 
think should be included in the regression? Using a table like Table 7.1, exam- 
ine the robustness of the confidence interval you constructed in (b). What is a 
reasonable 95% confidence interval for the effect of smoking on birth weight? 


E72 In the empirical exercises on earning and height in Chapters 4 and 5, you 
estimated a relatively large and statistically significant effect of a worker’s 
height on his or her earnings. One explanation for this result is omitted vari- 
able bias: Height is correlated with an omitted factor that affects earnings. 
For example, Case and Paxson (2008) suggest that cognitive ability (or intel- 
ligence) is the omitted factor. The mechanism they describe is straightforward: 
Poor nutrition and other harmful environmental factors in utero and in early 
childhood have, on average, deleterious effects on both cognitive and physi- 
cal development. Cognitive ability affects earnings later in life and thus is an 
omitted variable in the regression. 


a. Suppose that the mechanism described above is correct. Explain how 
this leads to omitted variable bias in the OLS regression of Earnings 
on Height. Does the bias lead the estimated slope to be too large or too 
small? [Hint: Review Equation (6.1).] 
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If the mechanism described above is correct, the estimated effect of height 
on earnings should disappear if a variable measuring cognitive ability is 
included in the regression. Unfortunately, there isn’t a direct measure of cogni- 
tive ability in the data set, but the data set does include years of education for 
each individual. Because students with higher cognitive ability are more likely 
to attend school longer, years of education might serve as a control variable for 
cognitive ability; in this case, including education in the regression will elimi- 
nate, or at least attenuate, the omitted variable bias problem. 

Use the years of education variable (educ) to construct four indicator 
variables for whether a worker has less than a high school diploma 
(LT_HS = 1 if educ < 12, 0 otherwise), a high school diploma (HS = 1 if 
educ = 12,0 otherwise), some college (Some_Col = 1if12 < educ < 16,0 
otherwise), or a bachelor’s degree or higher (College = 1 if educ = 16,0 
otherwise). 


b. Focusing first on women only, run a regression of (1) Earnings on Height 
and (2) Earnings on Height, including LT_HS, HS, and Some_Col as 
control variables. 

i. Compare the estimated coefficient on Height in regressions (1) and 
(2). Is there a large change in the coefficient? Has it changed in a way 
consistent with the cognitive ability explanation? Explain. 


ii. The regression omits the control variable College. Why? 


iii. Test the joint null hypothesis that the coefficients on the education 
variables are equal to 0. 

iv. Discuss the values of the estimated coefficients on LT_HS, HS, and 
Some_Col. (Each of the estimated coefficients is negative, and the 
coefficient on LT_HS is more negative than the coefficient on HS, 
which in turn is more negative than the coefficient on Some_Col. 
Why? What do the coefficients measure?) 


c Repeat (b), using data for men. 


APPENDIX 


7.1 The Bonferroni Test of a Joint Hypothesis 


The method of Section 7.2 is the preferred way to test joint hypotheses in multiple regression. 
However, if the author of a study presents regression results but did not test a joint restriction 
in which you are interested and if you do not have the original data, then you will not be able 
to compute the F-statistic as in Section 7.2. This appendix describes a way to test joint hypoth- 
eses that can be used when you have only a table of regression results. This method is an 


application of a very general testing approach based on Bonferroni’s inequality. 
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The Bonferroni test is a test of a joint hypothesis based on the f-statistics for the indi- 
vidual hypotheses; that is, the Bonferroni test is the one-at-a-time t-statistic test of Section 7.2 
done properly. The Bonferroni test of the joint null hypothesis B, = 61o and B, = By», based 


on the critical value c > 0, uses the following rule: 


Accept if |t;| < cand if |t| < c; otherwise, reject (7.22) 


(Bonferroni one-at-a-time t-statistic test) 


where ż and t, are the ¢-statistics that test the restrictions on 6; and $, respectfully. 

The trick is to choose the critical value c in such a way that the probability that the one- 
at-a-time test rejects when the null hypothesis is true is no more than the desired significance 
level—say, 5%. This is done by using Bonferroni’s inequality to choose the critical value c to 
allow both for the fact that two restrictions are being tested and for any possible correlation 


between ¢, and fp. 


Bonferroni's Inequality 


Bonferroni’s inequality is a basic result of probability theory. Let A and B be events. Let 
A(N B be the event “both A and B” (the intersection of A and B), and let AUB be the 
event “A or B or both” (the union of A and B). Then Pr(AUB) = Pr(A) + Pr(B) 
Pr(A()B).Because Pr(AN B) = 0,it follows that Pr(AUB) = Pr(A) + Pr(B).! Now let 
A be the event that |t| > c and B be the event that |f| > c. Then the inequality 
Pr(AUB) = Pr(A) + Pr(B) yields 


Pr(|t;| > cor |t| > corboth) = Pr(|t,| >c) + Pr(|H| >c). (7.23) 


Bonferroni Tests 


Because the event “|t| > cor |f| > cor both” is the rejection region of the one-at-a-time test, 
Equation (7.23) leads to a valid critical value for the one-at-a-time test. Under the null hypoth- 
esis in large samples, Pr(|t,;| > c) = Pr( |t| >c) = Pr(|Z| > c). Thus Equation (7.23) 


implies that in large samples the probability that the one-at-a-time test rejects under the null is 
Pr},,(one-at-a-time test rejects) = 2Pr( |Z| > c). (7.24) 


The inequality in Equation (7.24) provides a way to choose a critical value c so that the prob- 
ability of the rejection under the null hypothesis equals the desired significance level. The 
Bonferroni approach can be extended to more than two coefficients; if there are q restrictions 


under the null, the factor of 2 on the right-hand side in Equation (7.24) is replaced by q. 


This inequality can be used to derive other interesting inequalities. For example, it implies that 
1 — Pr(A U B) = 1 — [Pr(A) + Pr(B) ]. Let A°and B°be the complements of A and B-thatis, the events 
“not A” and “not B.” Because the complement of A U B is A N BY, 1 — Pr(A U B) = Pr(A N B°), 
which yields Bonferroni’s inequality, Pr(A° N B°) = 1 — [Pr(A) + Pr(B) ]. 
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Table 7.2 presents critical values c for the one-at-a-time Bonferroni test for various sig- 
nificance levels and q = 2, 3, and 4. For example, suppose the desired significance level is 5% 
and q = 2. According to Table 7.2, the critical value c is 2.241. This critical value is the 1.25 
percentile of the standard normal distribution, so Pr( |Z| > 2.241) = 2.5%. Thus Equation 
(7.24) tells us that in large samples the one-at-a-time test in Equation (7.22) will reject at most 


5% of the time under the null hypothesis. 


Bonferroni Critical Values c for the One-at-a-Time t-Statistic Test j 
of a Joint Hypothesis 
Significance Level 
Number of Restrictions (q) 10% 5% 1% 
2 1.960 2.241 2.807 
3 2.128 2.394 2.935 
4 2.241 2.498 3.023 


The critical values in Table 7.2 are larger than the critical values for testing a single restric- 
tion. For example, with q = 2, the one-at-a-time test rejects if at least one t-statistic exceeds 
2.241 in absolute value. This critical value is greater than 1.96 because it properly corrects for 
the fact that, by looking at two t-statistics, you get a second chance to reject the joint null 
hypothesis, as discussed in Section 7.2. 

If the individual t-statistics are based on heteroskedasticity-robust standard errors, then 
the Bonferroni test is valid whether or not there is heteroskedasticity, but if the t-statistics are 
based on homoskedasticity-only standard errors, the Bonferroni test is valid only under 


homoskedasticity. 


Application to Test Scores 


The f-statistics testing the joint null hypothesis that the true coefficients on test scores and 
expenditures per pupil in Equation (7.6) are, respectively, t; = —0.60 and f = 2.43. Although 
|4| < 2.241, because |t| > 2.241 we can reject the joint null hypothesis at the 5% signifi- 
cance level using the Bonferroni test. However, both t; and t, are less than 2.807 in absolute 
value, so we cannot reject the joint null hypothesis at the 1% significance level using the Bon- 
ferroni test. In contrast, using the F-statistic in Section 7.2, we were able to reject this hypoth- 


esis at the 1% significance level. 


Q Nonlinear Regression Functions 


| n Chapters 4 through 7, the population regression function was assumed to be linear; 
that is, it has a constant slope. In the context of causal inference, this constant slope 
corresponds to the effect on Y of a unit change in X being the same for all values of the 
regressors. But what if the effect on Y of a change in X in fact depends on the value of 
one or more of the regressors? If so, the population regression function is nonlinear. 
This chapter develops two groups of methods for detecting and modeling nonlinear 
population regression functions. The methods in the first group are useful when the rela- 
tionship between Y and an independent variable, X4, depends on the value of X; itself. 
For example, reducing class sizes by one student per teacher might have a greater effect 
if class sizes are already manageably small than if they are so large that the teacher can 
do little more than keep the class under control. If so, the test score (Y) is a nonlinear 
function of the student-teacher ratio (X4), where this function is steeper when X; is small. 
An example of a nonlinear regression function with this feature is shown in Figure 8.1. 
Whereas the linear population regression function in Figure 8.1(a) has a constant slope, 
the nonlinear population regression function in Figure 8.1(b) has a steeper slope when 
X; is small than when it is large. This first group of methods is presented in Section 8.2. 
The methods in the second group are useful when the effect on Y of a change 
in X, depends on the value of another independent variable—say, X2. For example, 
students still learning English might especially benefit from having more one-on-one 
attention; if so, the effect on test scores of reducing the student-teacher ratio will be 
greater in districts with many students still learning English than in districts with few 
English learners. In this example, the effect on test scores (Y) of a reduction in the 
student-teacher ratio (X4) depends on the percentage of English learners in the 
district (X2). As shown in Figure 8.1(c), the slope of this type of population regression func- 
tion depends on the value of X>. This second group of methods is presented in Section 8.3. 
In the models of Sections 8.2 and 8.3, the population regression function is a nonlinear 
function of the independent variables. Although they are nonlinear in the Xs, these models 
are linear functions of the unknown coefficients (or parameters) of the population regression 
model and thus are versions of the multiple regression model of Chapters 6 and 7. Therefore, 
the unknown parameters of these nonlinear regression functions can be estimated and 
tested using OLS and the methods of Chapters 6 and 7. In some applications, the regression 
function is a nonlinear function of the X's and of the parameters. If so, the parameters cannot 
be estimated by OLS, but they can be estimated using nonlinear least squares. Appendix 8.1 
provides examples of such functions and describes the nonlinear least squares estimator. 
Sections 8.1 and 8.2 introduce nonlinear regression functions in the context of 
regression with a single independent variable, and Section 8.3 extends this to two 
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| FIGURE8.1 | Population Regression Functions with Different Slopes 


Y Y 
Rise 
Rise Bun 
Run . 
Rise 
Run 
Xı Xı 

(a) Constant slope (b) Slope depends on the value of X4 
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Population regression function when X; = 0 


(c) Slope depends on the value of X3 


In Figure 8.1(a), the population regression function has a constant slope. In Figure 8.1(b), the slope of the popula- 
tion regression function depends on the value of X4. In Figure 8.1(c), the slope of the population regression function 
depends on the value of X2. 


xX 


independent variables. To keep things simple, additional regressors are omitted in the 
empirical examples of Sections 8.1 through 8.3. In practice, however, if the aim is to 
use the nonlinear model to estimate causal effects, it remains important to control 
for omitted factors by including control variables as well. In Section 8.4, we combine 
nonlinear regression functions and additional control variables when we take a close 
look at possible nonlinearities in the relationship between test scores and the 
student-teacher ratio, holding student characteristics constant. 

The aim of this chapter is to explain the main methods for modeling nonlinear 
regression functions. In Sections 8.1-8.3, we assume that the least squares assumptions 
for causal inference in multiple regression (Key Concept 6.4) hold, modified for a 
nonlinear regression function. Under those assumptions, the slopes of the nonlinear 
regression functions can be interpreted as causal effects. The methods of this chapter 
also can be used to model nonlinear population regression functions when some of 
the regressors are control variables (the assumptions in Key Concept 6.6) and when 
these functions are used for prediction (the assumptions in Appendix 6.4). 
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A General Strategy for Modeling 
Nonlinear Regression Functions 


This section lays out a general strategy for modeling nonlinear population regression 
functions. In this strategy, the nonlinear models are extensions of the multiple regres- 
sion model and therefore can be estimated and tested using the tools of Chapters 6 
and 7. First, however, we return to the California test score data and consider the 
relationship between test scores and district income. 


Test Scores and District Income 


In Chapter 7, we found that the economic background of the students is an important 
factor in explaining performance on standardized tests. That analysis used two economic 
background variables (the percentage of students qualifying for a subsidized lunch and 
the percentage of students whose families qualify for income assistance) to measure the 
fraction of students in the district coming from poor families. A different, broader mea- 
sure of economic background is the average annual per capita income in the school 
district (“district income”). The California data set includes district income measured in 
thousands of 1998 dollars. The sample contains a wide range of income levels: For the 
420 districts in our sample, the median district income is 13.7 (that is, $13,700 per 
person), and it ranges from 5.3 ($5300 per person) to 55.3 ($55,300 per person). 
Figure 8.2 shows a scatterplot of fifth-grade test scores against district income for 
the California data set, along with the OLS regression line relating these two 
variables. Test scores and district income are strongly positively correlated, with a 


a 
| FIGURE 8.2 | Scatterplot of Test Scores vs. District Income with a Linear OLS Regression Function 
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correlation coefficient of 0.71; students from affluent districts do better on the tests 
than students from poor districts. But this scatterplot has a peculiarity: Most of the 
points are below the OLS line when income is very low (under $10,000) or very high 
(over $40,000), but they are above the line when income is between $15,000 and 
$30,000. There seems to be some curvature in the relationship between test scores 
and district income that is not captured by the linear regression. 

In short, it seems that the relationship between district income and test scores is 
not a straight line. Rather, it is nonlinear. A nonlinear function is a function with a 
slope that is not constant: The function f(X) is linear if the slope of f(X) is the same 
for all values of X, but if the slope depends on the value of X, then f(X) is nonlinear. 

If a straight line is not an adequate description of the relationship between dis- 
trict income and test scores, what is? Imagine drawing a curve that fits the points in 
Figure 8.2. This curve would be steep for low values of district income and then would 
flatten out as district income gets higher. One way to approximate such a curve math- 
ematically is to model the relationship as a quadratic function. That is, we could 
model test scores as a function of income and the square of income. 

A quadratic population regression model relating test scores and income is writ- 
ten mathematically as 


TestScore; = By + ByIncome; + BoIncome? + u; (8.1) 
where fp, 61, and B are coefficients; Income; is the income in the i" district; Income? 
is the square of income in the i district; and u; is an error term that, as usual, repre- 
sents all the other factors that determine test scores. Equation (8.1) is called the 
quadratic regression model because the population regression function, 
E( TestScore;|Income;) = By + BiIncome; + BIncome? is a quadratic function of 
the independent variable, Income. 

If you knew the population coefficients Bp, 64, and B, in Equation (8.1), you could 
predict the test score of a district based on its average income. But these population 
coefficients are unknown and therefore must be estimated using a sample of data. 

At first, it might seem difficult to find the coefficients of the quadratic function 
that best fits the data in Figure 8.2. If you compare Equation (8.1) with the multiple 
regression model in Key Concept 6.2, however, you will see that Equation (8.1) is, in 
fact, a version of the multiple regression model with two regressors: The first regres- 
sor is Income, and the second regressor is Income’. Mechanically, you can create this 
second regressor by generating a new variable that equals the square of Income—for 
example, as an additional column in a spreadsheet. Thus, after defining the regressors 
as Income and Income’, the nonlinear model in Equation (8.1) is simply a multiple 
regression model with two regressors! 

Because the quadratic regression model is a variant of multiple regression, its 
unknown population coefficients can be estimated and tested using the OLS meth- 
ods described in Chapters 6 and 7 Estimating the coefficients of Equation (8.1) using 
OLS for the 420 observations in Figure 8.2 yields 


CTT a a a ee O AN 
| FIGURE 8.3 | Scatterplot of Test Scores vs. District Income with Linear and Quadratic Regression Functions 
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—_— —~ — 
TestScore = 607.3 + 3.85 Income — 0.0423 Income?, R? = 0.554, (8.2) 
(2.9) (0.27) (0.0048) 


where, as usual, standard errors of the estimated coefficients are given in parentheses. 
The estimated regression function of Equation (8.2) is plotted in Figure 8.3, super- 
imposed over the scatterplot of the data. The quadratic function captures the curva- 
ture in the scatterplot: It is steep for low values of district income but flattens out 
when district income is high. In short, the quadratic regression function seems to fit 
the data better than the linear one. 

We can go one step beyond this visual comparison and formally test the hypoth- 
esis that the relationship between district income and test scores is linear against the 
alternative that it is nonlinear. If the relationship is linear, then the regression func- 
tion is correctly specified as Equation (8.1) except that the regressor Income? is 
absent; that is, if the relationship is linear, then Equation (8.1) holds with B = 0. 
Thus we can test the null hypothesis that the population regression function is linear 
against the alternative that it is quadratic by testing the null hypothesis that B, = 0 
against the alternative that B) # 0. 

Because Equation (8.1) is just a variant of the multiple regression model, the 
null hypothesis that 6, = 0 can be tested by constructing the t-statistic for this 
hypothesis. This t-statistic is t = (Ê = 0)/SE(ĝ), which from Equation (8.2) is 
t = —0.0423/0.0048 = —8.81. In absolute value, this exceeds the 5% critical value 
of this test (which is 1.96). Indeed, the p-value for the t-statistic is less than 0.01%, so 
we can reject the hypothesis that 6} = 0 at all conventional significance levels. Thus 
this formal hypothesis test supports our informal inspection of Figures 8.2 and 8.3: 
The quadratic model fits the data better than the linear model. 
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The Effect on Y of a Change in X 
in Nonlinear Specifications 


Put aside the test score example for a moment, and consider a general problem. 
You want to know how the dependent variable Y is expected to change when the 
independent variable X, changes by the amount AX), holding constant other 
independent variables X, ..., Xp. When the population regression function is lin- 
ear, this effect is easy to calculate: As shown in Equation (6.4), the expected change 
in Y is AY = B,AX, where A, is the population regression coefficient multiplying 
X,. When the regression function is nonlinear, however, the expected change in Y 
is more complicated to calculate because it can depend on the values of the 
independent variables. 


A general formula for a nonlinear population regression function.’ The nonlinear 
population regression models considered in this chapter are of the form 


Y = fl Age pesia Mg) T upi = Licey Mh (8.3) 
where f(X1;, X;,..., X;;) is the population nonlinear regression function, a possibly 
nonlinear function of the independent variables Xi; X3;,..., Xki and u; 1s the error 


term. For example, in the quadratic regression model in Equation (8.1), only one inde- 
pendent variable is present, so X; is Income and the population regression function is 
f(Income;) = By + ByIncome; + BoIncome?. 

Because the population regression function is the conditional expectation of Y; 


given Xi; X3;,..., X;;, in Equation (8.3) we allow for the possibility that this condi- 
tional expectation is a nonlinear function of Xj; that is, E( Y,| Xip Xn ..., Xm) = 
F(X, Xi . ., Xi), where fcan be a nonlinear function. If the population regression 


function is linear, then f(X1;, Xo;,..., Xi) = Bo + BX + +++ + BkXki and Equa- 
tion (8.3) becomes the linear regression model in Key Concept 6.2. However, Equa- 
tion (8.3) allows for nonlinear regression functions as well. 


The effect on Y of a change in X;. Suppose an experiment is conducted on 
individuals with the same values of X>,..., X,, and participants are randomly 
assigned treatment levels X; = xı or X; + AX, = xı + Axı. Then the expected 
difference in outcomes is the causal effect of the treatment, holding constant 
X>,...,X;,. In the nonlinear regression model of Equation (8.3), this effect on Y is 
AY = f(X + AX, X,..., X) — f(X X, ..., Xp). In the context of prediction, 


'The term nonlinear regression applies to two conceptually different families of models. In the first 
family, the population regression function is a nonlinear function of the X’s but is a linear function of 
the unknown parameters (the 8’s). In the second family, the population regression function is a non- 
linear function of the unknown parameters and may or may not be a nonlinear function of the X’s. The 
models in the body of this chapter are all in the first family. Appendix 8.1 takes up models from the 
second family. 
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The Expected Change in Y from a Change in X, 


in the Nonlinear Regression Model [Equation (8.3)] 


The expected change in Y, AY, associated with the change in X;, AX), holding 
X>,..., X;, constant, is the difference between the value of the population regres- 
sion function before and after changing X, holding X3,..., X; constant. That is, 
the expected change in Y is the difference: 


DA = F(X + AX, Xz, Goo A = f(X, Xz, Gorn , Xy). (8.4) 


The estimator of this unknown population difference is the difference between 
the predicted values for these two cases. Let Os X>,..., X;) be the predicted 
value of Y based on the estimator f of the population regression function. Then 
the predicted change in Y is 


8.1 


NY = FOG AK, Ge A) = fo Kee (8.5) 


AY = f(X + AX, X, ..., Xk) — f(X, X, ...,X,) is the predicted difference in 
Y for two observations, both with the same values of X>,..., Xp, but with different 
values of X4, specifically X, + AX; and Xj. 

Because the regression function fis unknown, this population causal effect is also 
unknown. To estimate this effect, first estimate the regression function f. At a general 
level, denote this estimated function by Ê an example of such an estimated function 
is the estimated quadratic regression function in Equation (8.2). The estimated effect 
on Y (denoted AY) of the change in X is the difference between the predicted value 
of Y when the independent variables take on the values X, + AX, X5,..., Xk and 
the predicted value of Y when they take on the values X, X5,..., Xx. 

The method for calculating the predicted change in Y associated with a change 
in X; is summarized in Key Concept 8.1. The computational method in Key 
Concept 8.1 always works, whether AX; is large or small and whether the regressors 
are continuous or discrete. Appendix 8.2 shows how to evaluate the slope using cal- 
culus for the special case of a single continuous regressor when AX; small. 


Application to test scores and district income. What is the predicted change in test 
scores associated with a change in district income of $1000, based on the estimated 
quadratic regression function in Equation (8.2)? Because that regression function is 
quadratic, this effect depends on the initial district income. We therefore consider two 
cases: an increase in district income from 10 to 11 (i.e., from $10,000 per capita to 
$11,000 per capita) and an increase in district income from 40 to 41 (i.e., from $40,000 
per capita to $41,000 per capita). 
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To compute AY associated with the change in income from 10 to 11, we can 
apply the general formula in Equation (8.5) to the quadratic regression model. Doing 
so yields 


AY = (By + Êi X 11 + ĝ x 11°) — (By + B X 10 + Ê x 107), (8.6) 


where Bos Bi, and Ê are the OLS estimators. 

The term in the first set of parentheses in Equation (8.6) is the predicted value of 
Y when Income = 11, and the term in the second set of parentheses is the predicted 
value of Y when Income = 10. These predicted values are calculated using the OLS 
estimates of the coefficients in Equation (8.2). Accordingly, when Income = 10, the 
predicted value of test scores is 607.3 + 3.85 xX 10 — 0.0423 x 10? = 641.57. When 
Income = 11, the predicted value is 607.3 + 3.85 X 11 — 0.0423 X 11? = 644.53. 
The difference in these two predicted values is AY = 644.53 — 641.57 = 2.96 points; 
that is, the predicted difference in test scores between a district with average income 
of $11,000 and one with average income of $10,000 is 2.96 points. 

In the second case, when income changes from $40,000 to $41,000, the difference 
in the predicted values in Equation (8.6) is AY = (607.3 + 3.85 X 41 — 0.0423 x 
41?) — (607.3 + 3.85 X 40 — 0.0423 x 407) = 694.04 — 693.62 = 0.42 points. Thus 
a change of income of $1000 is associated with a larger change in predicted test 
scores if the initial income is $10,000 than if it is $40,000 (the predicted changes are 
2.96 points versus 0.42 points). Said differently, the slope of the estimated quadratic 
regression function in Figure 8.3 is steeper at low values of income (like $10,000) 
than at the higher values of income (like $40,000). 


Standard errors of estimated effects. The estimate of the effect on Y of changing X 
depends on the estimate of the population regression function, f, which varies from 
one sample to the next. Therefore, the estimated effect contains a sampling error. 
One way to quantify the sampling uncertainty associated with the estimated effect is 
to compute a confidence interval for the true population effect. To do so, we need to 
compute the standard error of AY in Equation (8.5). 

It is easy to compute a standard error for AY when the regression function is 
linear. The estimated effect of a change in X; is BAX, so the standard error of AY 
is SE (AY) = SE (Êi) A X; and a 95% confidence interval for the estimated change 
is B,AX, + 1.96 SE(B,) AX. 

In the nonlinear regression models of this chapter, the standard error of AY can 
be computed using the tools introduced in Section 73 for testing a single restriction 
involving multiple coefficients. To illustrate this method, consider the estimated 
change in test scores associated with a change in income from 10 to 11 in Equation (8.6), 
which is AY = f, x (11 — 10) + B x (112 — 10) = ĝi + 216). The standard 
error of the predicted change therefore is 


SE(AY) = SE(B, + 216). (8.7) 
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Thus, if we can compute the standard error of Bi + 21 Ê, then we have computed the 
standard error of AY. 

Some regression software has a specialized command for computing the stan- 
dard error in Equation (8.7) directly. If not, there are two other ways to compute it; 
these correspond to the two approaches in Section 73 for testing a single restriction 
on multiple coefficients. 

The first method is to use approach 1 of Section 73, which is to compute the 
F-statistic testing the hypothesis that 8, + 216) = 0. The standard error of AY is 
then given by? 
sE(AY) =! aad 
(AY) va 
When applied to the quadratic regression in Equation (8.2), the F-statistic testing the 
hypothesis that B, + 218, = 0 is F = 299.94. Because AY = 2.96, applying Equa- 
tion (8.8) gives SE(AY) = 2.96/ 299.94 = 0.17. Thus a 95% confidence interval 


for the change in the expected value of Y is 2.96 + 1.96 X 0.17 or (2.63, 3.29). 
The second method is to use approach 2 of Section 73, which entails transform- 


(8.8) 


ing the regressors so that, in the transformed regression, one of the coefficients is 
Bı + 21>. Doing this transformation is left as an exercise (Exercise 8.9). 


A comment on interpreting coefficients in nonlinear specifications. In the multiple 
regression model of Chapters 6 and 7, the regression coefficients had a natural 
interpretation. For example, 6; is the expected change in Y associated with a change 
in Xj, holding the other regressors constant. But as we have seen, this is not generally 
the case in a nonlinear model. That is, it is not very helpful to think of 6, in Equation 
(8.1) as being the effect of changing the district income, holding the square of the 
district income constant. In nonlinear models, the regression function is best inter- 
preted by graphing it and by calculating the predicted effect on Y of changing one or 
more of the independent variables. 


A General Approach to Modeling Nonlinearities 
Using Multiple Regression 


The general approach to modeling nonlinear regression functions taken in this 
chapter has five elements: 


1. Identify a possible nonlinear relationship. The best thing to do is to use eco- 
nomic theory and what you know about the application to suggest a possible 
nonlinear relationship. Before you even look at the data, ask yourself whether 
the slope of the regression function relating Y and X might reasonably depend 


Equation (8.8) is derived by noting that the F-statistic is the square of the t-statistic testing this hypoth- 
esis—that is, F = £? = [(B, + 218) /SE(B, + 218,) }? = [AY/SE(AY) ]?—and solving for SE( AY). 
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on the value of X or on another independent variable. Why might such nonlinear 
dependence exist? What nonlinear shapes does this suggest? For example, think- 
ing about classroom dynamics with 11-year-olds suggests that cutting class size 
from 18 students to 17 could have a greater effect than cutting it from 30 to 29. 


2. Specify a nonlinear function, and estimate its parameters by OLS. Sections 
8.2 and 8.3 contain various nonlinear regression functions that can be estimated 
by OLS. After working through these sections, you will understand the charac- 
teristics of each of these functions. 


3. Determine whether the nonlinear model improves upon a linear model. Just 
because you think a regression function is nonlinear does not mean it really is! 
You must determine empirically whether your nonlinear model is appropriate. 
Most of the time you can use f-statistics and F-statistics to test the null hypoth- 
esis that the population regression function is linear against the alternative that 
it is nonlinear. 


4. Plot the estimated nonlinear regression function. Does the estimated regres- 
sion function describe the data well? Looking at Figures 8.2 and 8.3 suggests 
that the quadratic model fits the data better than the linear model. 

5. Estimate the effect on Y of a change in X. The final step is to use the estimated 
regression to calculate the effect on Y of a change in one or more regressors X 
using the method in Key Concept 8.1. 


8.2 Nonlinear Functions of a Single 


Independent Variable 


This section provides two methods for modeling a nonlinear regression function. To 
keep things simple, we develop these methods for a nonlinear regression function 
that involves only one independent variable, X. As we see in Section 8.5, however, 
these models can be modified to include multiple independent variables. 

The first method discussed in this section is polynomial regression, an extension of the 
quadratic regression used in the last section to model the relationship between test scores 
and district income. The second method uses logarithms of X, of Y, or of both X and Y. 
Although these methods are presented separately, they can be used in combination. 

Appendix 8.2 provides a calculus-based treatment of the models in this section. 


Polynomials 


One way to specify a nonlinear regression function is to use a polynomial in X. In 
general, let r denote the highest power of X that is included in the regression. The 
polynomial regression model of degree r is 


Y; = Bo + BIX; + BX? ++- + BAT + ui (8.9) 
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When r = 2, Equation (8.9) is the quadratic regression model discussed in Section 8.1. 
When r = 3, so that the highest power of X included is X’, Equation (8.9) is called 
the cubic regression model. 

The polynomial regression model is similar to the multiple regression model of 
Chapter 6 except that in Chapter 6 the regressors were distinct independent vari- 
ables, whereas here the regressors are powers of the same dependent variable, X; that 
is, the regressors are X, X 2 X? and so on. Thus the techniques for estimation and 
inference developed for multiple regression can be applied here. In particular, the 
unknown coefficients Bo, B),..., B, in Equation (8.9) can be estimated by OLS 
regression of Y, against X;, X?,...,X7. 


Testing the null hypothesis that the population regression function is linear. If the 
population regression function is linear, then the quadratic and higher-degree terms 
do not enter the population regression function. Accordingly, the null hypothesis 
(Ho) that the regression is linear and the alternative (H,) that it is a polynomial of 
degree up to r correspond to 


Ho: & = 0, By = 0,..., B, = Ovs. Hi :at least one 6; # 0, j =2,...,r. (8.10) 


The null hypothesis that the population regression function is linear can be tested 
against the alternative that it is a polynomial of degree up to r by testing Hp against 
H; in Equation (8.10). Because Hp is a joint null hypothesis with q = r — 1 restric- 
tions on the coefficients of the population polynomial regression model, it can be 
tested using the F-statistic as described in Section 72. 


Which degree polynomial should I use? That is, how many powers of X should be 
included in a polynomial regression? The answer balances a trade-off between flex- 
ibility and statistical precision. Increasing the degree r introduces more flexibility 
into the regression function and allows it to match more shapes; a polynomial of 
degree r can have up to r — 1 bends (that is, inflection points) in its graph. But 
increasing r means adding more regressors, which can reduce the precision of the 
estimated coefficients. 

Thus the answer to the question of how many terms to include is that you should 
include enough to model the nonlinear regression function adequately—but no 
more. Unfortunately, this answer is not very useful in practice! 

A practical way to determine the degree of the polynomial is to ask whether the 
coefficients in Equation (8.9) associated with largest values of r are 0. If so, then these 
terms can be dropped from the regression. This procedure, which is called sequential 
hypothesis testing because individual hypotheses are tested sequentially, is summa- 
rized in the following steps: 


1. Pick a maximum value of r, and estimate the polynomial regression for 
that r. 
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2. Use the t-statistic to test the hypothesis that the coefficient on X”, 8, in Equation 
(8.9), is 0. If you reject this hypothesis, then X” belongs in the regression, so use 
the polynomial of degree r. 


3. If you do not reject B, = 0 in step 2, eliminate X” from the regression, and 
estimate a polynomial regression of degree r — 1. Test whether the coefficient 
on X”! is 0. If you reject, use the polynomial of degree r — 1. 


4. If you do not reject B,_, = 0 in step 3, continue this procedure until the 
coefficient on the highest power in your polynomial is statistically significant. 


This recipe has one missing ingredient: the initial degree r of the polynomial. In many 
applications involving economic data, the nonlinear functions are smooth; that is, 
they do not have sharp jumps, or “spikes.” If so, then it is appropriate to choose a 
small maximum degree for the polynomial, such as 2, 3, or 4— that is, to begin with 
r = 2o0r3 or 4 instep 1. 


Application to district income and test scores. The estimated cubic regression func- 
tion relating district income to test scores is 


-n S 
TestScore = 600.1 + 5.02 Income — 0.096 Income? + 0.00069 Income’, 
(5.1) (0.71) (0.029) (0.00035) 


R? = 0.555. (8.11) 


The t-statistic on Income? is 1.97, so the null hypothesis that the regression function 
is a quadratic is rejected against the alternative that it is a cubic at the 5% level. 
Moreover, the F-statistic testing the joint null hypothesis that the coefficients on 
Income? and Income? are both 0 is 377, with a p-value less than 0.01%, so the null 
hypothesis that the regression function is linear is rejected against the alternative 
that it is either a quadratic or a cubic. 


Interpretation of coefficients in polynomial regression models. The coefficients 
in polynomial regressions do not have a simple interpretation. The best way to 
interpret polynomial regressions is to plot the estimated regression function and 
calculate the estimated effect on Y associated with a change in X for one or more 
values of X. 


Logarithms 


Another way to specify a nonlinear regression function is to use the natural loga- 
rithm of Y and/or X. Logarithms convert changes in variables into percentage 
changes, and many relationships are naturally expressed in terms of percentages. 
Here are some examples: 
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e A box in Chapter 3, “Social Class or Education? Childhood Circumstances and 
Adult Earnings Revisited,” examined the household earnings gap by socioeco- 
nomic classification. In that discussion, the wage gap was measured in terms of 
pounds sterling. However, it is easier to compare wage gaps across professions 
and over time when they are expressed in percentage terms. 


e In Section 8.1, we found that district income and test scores were nonlinearly 
related. Would this relationship be linear using percentage changes? That is, might 
it be that a change in district income of 1% —rather than $1000—is associated with 
a change in test scores that is approximately constant for different values of income? 


e In the economic analysis of consumer demand, it is often assumed that a 1% 
increase in price leads to a certain percentage decrease in the quantity demanded. 
The percentage decrease in demand resulting from a 1% increase in price is 
called the price elasticity. 


Regression specifications that use natural logarithms allow regression models to esti- 
mate percentage relationships such as these. Before introducing those specifications, 
we review the exponential and natural logarithm functions. 


The exponential function and the natural logarithm. The exponential function and 
its inverse, the natural logarithm, play an important role in modeling nonlinear regres- 
sion functions. The exponential function of x is e* (that is, e raised to the power x), 
where e is the constant 2.71828 ...; the exponential function is also written as exp(x). 
The natural logarithm is the inverse of the exponential function; that is, the natural 
logarithm is the function for which x = In(e*) or, equivalently, x = In[exp(x) ]. The 
base of the natural logarithm is e. Although there are logarithms in other bases, such 
as base 10, in this text we consider only logarithms in base e— that is, the natural loga- 
rithm—so when we use the term logarithm, we always mean natural logarithm. 

The logarithm function y = In(x) is graphed in Figure 8.4. Note that the loga- 
rithm function is defined only for positive values of x. The logarithm function has a 
slope that is steep at first and then flattens out (although the function continues to 
increase). The slope of the logarithm function ln(x) is 1/x. 

The logarithm function has the following useful properties: 


In(1/x) = —In(x); (8.12) 
In(ax) = In(a) + In(x); (8.13) 
In(x/a) = In(x) — In(a); and (8.14) 
In(x*) = a In(x). (8.15) 


Logarithms and percentages. The link between the logarithm and percentages 
relies on a key fact: When Ax is small, the difference between the logarithm of 
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The Logarithm Function, y = In(x) 


The logarithmic function y = In(x) is steeper for 
small than for large values of x, is defined only 
forx > 0,and has slope 1/x. 


y = In(x) 


| | | | | | 
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x + Ax and the logarithm of x is approximately Ax/x, the percentage change in x 
divided by 100. That is, 


A A 
ln(x + Ax) — ln(x) = (wen is smal) (8.16) 


where “=” means “approximately equal to.” The derivation of this approximation 
relies on calculus, but it is readily demonstrated by trying out some values of x and 
Ax. For example, when x = 100 and Ax = 1, then Ax/x = 1/100 = 0.01 (or 1%), 
while In(x + Ax) — In(x) = In(101) — In(100) = 0.00995 (or 0.995%). Thus 
Ax/x (which is 0.01) is very close to In(x + Ax) — In(x) (which is 0.00995). When 
Ax = 5, Ax/x = 5/100 = 0.05, while In(x + Ax) — In(x) = In(105) — In(100) = 
0.04879. 


The three logarithmic regression models. There are three different cases in which 
logarithms might be used: when_X is transformed by taking its logarithm but Y is not; 
when Y is transformed to its logarithm but X is not; and when both Y and X are 
transformed to their logarithms. The interpretation of the regression coefficients is 
different in each case. We discuss these three cases in turn. 


Case |: X is in logarithms, Y is not. In this case, the regression model is 
Y, = By + bi In(X%) tu, i=1,...,n. (8.17) 


Because Y is not in logarithms but X is, this is sometimes referred to as a linear-log model. 
In the linear-log model, a 1% change in X is associated with a change in Y of 0.0164. 
To see this, consider the differences inbetween the population regression function at 


8.2 Nonlinear Functions of a Single Independent Variable 291 


values of X that differ by AX: This is [6o + B; Nn(X + AX)] — [Bo + B, IN(X)] = 
B,[In(X¥ + AX) — In(X)] = B,(AX/X), where the final step uses the approxima- 
tion in Equation (8.16). If X changes by 1%, then AX /X = 0.01; thus in this model a 
1% change in X is associated with a change of Y of 0.01. 

The only difference between the regression model in Equation (8.17) and the 
regression model of Chapter 4 with a single regressor is that the right-hand variable 
is now the logarithm of X rather than X itself. To estimate the coefficients By and B; 
in Equation (8.17), first compute a new variable, In(X), which is readily done using a 
spreadsheet or statistical software. Then By) and 6 can be estimated by the OLS 
regression of Y;on In(X;), hypotheses about £; can be tested using the t-statistic, and 
a 95% confidence interval for 6, can be constructed as Bi + 1.96 SE (Bi). 

As an example, return to the relationship between district income and test scores. 
Instead of the quadratic specification, we could use the linear-log specification in 
Equation (8.17). Estimating this regression by OLS yields 


aa SS — 
TestScore = 557.8 + 36.42In(Income), R? = 0.561. 


(3.8) (1.40) ee 


According to Equation (8.18), a 1% increase in income is associated with an increase 
in test scores of 0.01 X 36.42 = 0.36 points. 

To estimate the effect on Y of a change in X in its original units of thousands 
of dollars (not in logarithms), we can use the method in Key Concept 8.1. For 
example, what is the predicted difference in test scores for districts with aver- 
age incomes of $10,000 versus $11,000? The estimated value of AY is the differ- 
ence between the predicted values: AY = [557.8 + 36.42In(11)] — [557.8 + 
36.42 In(10) |] = 36.42 x [In(11) — In(10) ] = 3.47. Similarly, the predicted differ- 
ence between a district with average income of $40,000 and a district with average 
income of $41,000 is 36.42 x [In(41) — In(40) ] = 0.90. Thus, like the quadratic 
specification, this regression predicts that a $1000 increase in income has a larger 
effect on test scores in poor districts than it does in affluent districts. 

The estimated linear-log regression function in Equation (8.18) is plotted in Fig- 
ure 8.5. Because the regressor in Equation (8.18) is the natural logarithm of income 
rather than income, the estimated regression function is not a straight line. Like the 
quadratic regression function in Figure 8.3, it is initially steep but then flattens out 
for higher levels of income. 


Case II: Y is in logarithms, X is not. In this case, the regression model is 
In(¥;) = Bo + BX; + u; (8.19) 


Because Y is in logarithms but X is not, this is referred to as a log-linear model. 

In the log-linear model, a one-unit change in X( AX = 1) is associated with a 
(100 x B,)% change in Y. To see this, compare the expected values of In(Y) for 
values of X that differ by AX. The expected value of In(Y) given XisIn(Y) = By + BX. 
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a 
| FIGURE8.5 | The Linear-Log Regression Function 


The estimated linear-log regression function Test score 
Ý = Bo + ĝi In(X) captures much of the nonlinear 740 - 
relation between test scores and district income. 
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For X + AX, the expected value is given by In(Y + AY) = By + B,(X + AX). 
Thus the difference between these expected values is In(Y + AY) — In(Y) = 
[Bo + Bi(X + AX)] — [Bo + BX] = B,AX. From the approximation in Equation 
(8.16), however, if B,AX is small, then In(Y + AY) — In(Y) = AY/Y. Thus 
AY/Y = B,AX.If AX = 1,so that X changes by one unit, then AY/Y changes by 
B,. Translated into percentages, a one-unit change in X is associated with a 
(100 x B,)% change in Y. 

As an illustration, we return to the empirical example of Section 3.7 the rela- 
tionship between age and earnings of college graduates. Some employment con- 
tracts specify that, for each additional year of service, a worker gets a certain 
percentage increase in his or her wage. This percentage relationship suggests esti- 
mating the log-linear specification in Equation (8.19) so that each additional year 
of age (X) is, on average, associated with some constant percentage increase in 
earnings (Y). By first computing the new dependent variable, In( Earnings;), the 
unknown coefficients By) and fı can be estimated by the OLS regression of 
In( Earnings;) against Age;. When estimated using the 13,872 observations on col- 
lege graduates in the March 2016 Current Population Survey (the data are described 
in Appendix 3.1), this relationship is 


So m — 
In(Earnings) = 2.876 + 0.0095 Age, R? = 0.033. 


(0.019) (0.0004) (8.20) 


According to this regression, earnings are predicted to increase by 0.95% 
[ (100 x 0.0095) % ] for each additional year of age. 
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Case III: Both X and Y are in logarithms. In this case, the regression model is 
In(¥;) = Bo + Bı In(X;) + u; (8.21) 


Because both Y and X are specified in logarithms, this is referred to as a log-log 
model. 

In the log-log model, a 1% change in X is associated with a B;% change in Y. 
Thus in this specification & is the elasticity of Y with respect to X. To see this, again 
apply Key Concept 8.1; thus In(Y + AY) — In(Y) = [fo + Byln(X + AX)] - 
[Bo + Biln(X)] = B,[In(X¥ + AX) — In(X) ]. Application of the approximation in 
Equation (8.16) to both sides of this equation yields 


AY AX 


y x” 


_ AY/Y 100x (AY/Y) _ percentage change in Y 
Bi = AX/X 100 X (AX/X) percentage change in X` 


(8.22) 


Thus in the log-log specification &; is the ratio of the percentage change in Y associ- 
ated with the percentage change in X. If the percentage change in Xis 1% (that is, if 
AX = 0.01X), then £ is the percentage change in Y associated with a 1% change in X. 
That is, 8; is the elasticity of Y with respect to X. 

As an illustration, return to the relationship between district income and test 
scores. When this relationship is specified in this form, the unknown coefficients are 
estimated by a regression of the logarithm of test scores against the logarithm of 
district income. The resulting estimated equation is 


eel 

In(TestScore) = 6.336 + 0.0554In(Income), R? = 0.557. (8.23) 
(0.006) (0.0021) ` 
According to this estimated regression function, a 1% increase in income is estimated 
to correspond to a 0.0554% increase in test scores. 

The estimated log-log regression function in Equation (8.23) is plotted in 
Figure 8.6. Because Y is in logarithms, the vertical axis in Figure 8.6 is the logarithm 
of the test score, and the scatterplot is the logarithm of test scores versus district 
income. For comparison purposes, Figure 8.6 also shows the estimated regression 
function for a log-linear specification, which is 


ee T = 
In(TestScore) = 6.439 + 0.00284 Income, R? = 0.497. (8.24) 
(0.003) (0.00018) f 
Because the vertical axis is in logarithms, the regression function in Equation (8.24) 
is the straight line in Figure 8.6. 
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The Log-Linear and Log-Log Regression Functions 


In the log-linear regression function, In(Y) is a In(Test score) 
linear function of X. In the log-log regression 6.60 - f ; 
function, In(Y) is a linear function of In(X). Log-linear regression 
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District income 
(thousands of dollars) 


As you can see in Figure 8.6, the log-log specification fits better than the log- 
linear specification. This is consistent with the higher R? for the log-log regression 
(0.557) than for the log-linear regression (0.497). Even so, the log-log specification 
does not fit the data especially well: At the lower values of income, most of the obser- 
vations fall below the log-log curve, while in the middle income range most of the 
observations fall above the estimated regression function. 

The three logarithmic regression models are summarized in Key Concept 8.2. 


A difficulty with comparing logarithmic specifications. Which of the log regression 
models best fits the data? As we saw in the discussion of Equations (8.23) and (8.24), 
the R? can be used to compare the log-linear and log-log models; as it happened, the 
log-log model had the higher R°. Similarly, the R? can be used to compare the linear- 
log regression in Equation (8.18) and the linear regression of Y against X. In the test 
score and district income regression, the linear-log regression has an R? of 0.561, while 
the linear regression has an R? of 0.508, so the linear-log model fits the data better. 

How can we compare the linear-log model and the log-log model? Unfortu- 
nately, the R? cannot be used to compare these two regressions because their depen- 
dent variables are different [one is Y, the other is In( Y)]. Recall that the R? measures 
the fraction of the variance of the dependent variable explained by the regressors. 
Because the dependent variables in the log-log and linear-log models are different, 
it does not make sense to compare their R°’s. 

Because of this problem, the best thing to do in a particular application is to 
decide, using economic theory and either your or other experts’ knowledge of the 
problem, whether it makes sense to specify Y in logarithms. For example, labor econo- 
mists typically model earnings using logarithms because wage comparisons, contract 
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Logarithms in Regression: Three Cases 


Logarithms can be used to transform the dependent variable Y, an indepen- 


8.2 


dent variable X, or both (but the variable being transformed must be positive). 
The following table summarizes these three cases and the interpretation of the 


regression coefficient 64. In each case, B, can be estimated by applying OLS after 


taking the logarithm of the dependent and/or independent variable. 


Case 
I 


Il 


Ill 


Regression Specification Interpretation of 6: 
Y; = Bo + Biln(X;) + u; A 1% change in X is associated with a 


change in Y of 0.0164. 


In(Y;) = Bo + BX + ui A change in X by one unit (AX = 1) is 


associated with a 1008; % change in Y. 


In(Y¥;) = o + Biln(X;) +u; A1% change in X is associated with a B,% 


change in Y, so & is the elasticity of Y with 
respect to X. 


wage increases, and so forth are often most naturally discussed in percentage terms. 
In modeling test scores, it seems natural (to us, anyway) to discuss test results in terms 
of points on the test rather than percentage increases in the test scores, so we focus on 
models in which the dependent variable is the test score rather than its logarithm. 


Computing predicted values of Y when Y is in logarithms.’ If the dependent vari- 
able Y has been transformed by taking logarithms, the estimated regression can be 
used to compute directly the predicted value of In(Y). However, it is a bit trickier to 
compute the predicted value of Y itself. 

To see this, consider the log-linear regression model in Equation (8.19), and 
rewrite it so that it is specified in terms of Y rather than In(Y). To do so, take the 
exponential function of both sides of Equation (8.19); the result is 


Y, = exp(By + BX; + uj) = eft PixXiew, (8.25) 


The expected value of Y, given X; is E(Y,|X;) = E(e%**i%ie | X) = eft BX 
E(e“|X;). The problem is that even if E(u;|X;) = 0, E(e“|X;) # 1. Thus the 
appropriate predicted value of Y; is not simply obtained by taking the exponential 
function of Bo + B,X,—that is, by setting Y; = eft bX. This predicted value is biased 
because of the missing factor E(e“ | X;). 

One solution to this problem is to estimate the factor E(e“ | X;) and use this 
estimate when computing the predicted value of Y. Exercise 1712 works through 


ŝThis material is more advanced and can be skipped without loss of continuity. 
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several ways to estimate E(e" | X;), but this gets complicated, particularly if u; is 
heteroskedastic, and we do not pursue it further. 

Another solution, which is the approach used in this text, is to compute predicted 
values of the logarithm of Y but not transform them to their original units. In prac- 
tice, this is often acceptable because when the dependent variable is specified as a 
logarithm, it is often most natural just to use the logarithmic specification (and the 
associated percentage interpretations) throughout the analysis. 


Polynomial and Logarithmic Models 
of Test Scores and District Income 


In practice, economic theory or expert judgment might suggest a functional form to 
use, but in the end, the true form of the population regression function is unknown. 
In practice, fitting a nonlinear function therefore entails deciding which method or 
combination of methods works best. As an illustration, we compare polynomial and 
logarithmic models of the relationship between district income and test scores. 


Polynomial specifications. We considered two polynomial specifications, quadratic 
[Equation (8.2)] and cubic [Equation (8.11)]. Because the coefficient on Income? in 
Equation (8.11) was significant at the 5% level, the cubic specification provided an 
improvement over the quadratic, so we select the cubic model as the preferred poly- 
nomial specification. 


Logarithmic specifications. The logarithmic specification in Equation (8.18) seemed 
to provide a good fit to these data, but we did not test this formally. One way to do 
so is to augment it with higher powers of the logarithm of income. If these additional 
terms are not statistically different from 0, then we can conclude that the specifica- 
tion in Equation (8.18) is adequate in the sense that it cannot be rejected against a 
polynomial function of the logarithm. Accordingly, the estimated cubic regression 
(specified in powers of the logarithm of income) is 


a 
TestScore = 486.1 + 113.4 In(Income) — 26.9[In( Income) |? 


(79.4) (87.9) (31.7) 
+ 3.06[1n (Income) }*, R? = 0.560. (8.26) 
(3.74) 


The t-statistic on the coefficient on the cubic term is 0.818, so the null hypothesis that 
the true coefficient is 0 is not rejected at the 10% level. The F-statistic testing the 
joint hypothesis that the true coefficients on the quadratic and cubic term are both 0 
is 0.44, with a p-value of 0.64, so this joint null hypothesis is not rejected at the 10% 
level. Thus the cubic logarithmic model in Equation (8.26) does not provide a statisti- 
cally significant improvement over the model in Equation (8.18), which is linear in 
the logarithm of income. 


A The Linear-Log and Cubic Regression Functions 


The estimated cubic regression function Test score 
[Equation (8.11)] and the estimated linear-log 740 - 
regression function [Equation (8.18)] are nearly 
identical in this sample. 
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Comparing the cubic and linear-log specifications. Figure 8.7 plots the estimated 
regression functions from the cubic specification in Equation (8.11) and the linear- 
log specification in Equation (8.18). The two estimated regression functions are quite 
similar. One statistical tool for comparing these specifications is the R. The R? of the 
logarithmic regression is 0.561, and for the cubic regression, it is 0.555. Because the 
logarithmic specification has a slight edge in terms of the R? and because this speci- 
fication does not need higher-degree polynomials in the logarithm of income to fit 
these data, we adopt the logarithmic specification in Equation (8.18). 


Interactions Between Independent Variables 


In the introduction to this chapter, we wondered whether reducing the student-teacher 
ratio might have a bigger effect on test scores in districts where many students are still 
learning English than in those with few still learning English. This could arise, for exam- 
ple, if students who are still learning English benefit differentially from one-on-one or 
small-group instruction. If so, the presence of many English learners in a district would 
interact with the student-teacher ratio in such a way that the effect on test scores of a 
change in the student-teacher ratio would depend on the fraction of English learners. 

This section explains how to incorporate such interactions between two indepen- 
dent variables into the multiple regression model. The possible interaction between 
the student-teacher ratio and the fraction of English learners is an example of the 
more general situation in which the effect on Y of a change in one independent vari- 
able depends on the value of another independent variable. We consider three cases: 
when both independent variables are binary, when one is binary and the other is 
continuous, and when both are continuous. 


298 


CHAPTER 8 Nonlinear Regression Functions 


Interactions Between Two Binary Variables 


Consider the population regression of log earnings [Y;, where Y; = In( Earnings;) | 
against two binary variables: whether a worker has a college degree (Dı; where 
D,; = 1 if the i‘ person graduated from college) and the worker’s sex (D>;, where 
Da; = 1 if the i“ person is female). The population linear regression of Y, on these 
two binary variables is 


Y; = Bo + ByDiyj + Dai + uj. (8.27) 


In this regression model, 6, is the effect on log earnings of having a college degree, 
holding sex constant, and $; is the mean difference between female and male earn- 
ings, holding schooling constant. 

The specification in Equation (8.27) has an important limitation: The effect of 
having a college degree in this specification, holding constant sex, is the same for 
men and women. There is, however, no reason that this must be so. Phrased math- 
ematically, the effect on Y; of D,;, holding D»; constant, could depend on the value 
of D;. In other words, there could be an interaction between having a college 
degree and sex, so that the value in the job market of a degree is different for men 
and women. 

Although the specification in Equation (8.27) does not allow for this interaction 
between having a college degree and sex, it is easy to modify the specification so that 
it does by introducing another regressor, the product of the two binary variables, 
Dj; X Dy. The resulting regression is 


Y; = Bo + BD; + ByDo; + B3(Dy; X Dzi) + uj. (8.28) 


The new regressor, the product Dj; X Dz; is called an interaction term or an 
interacted regressor, and the population regression model in Equation (8.28) is called 
a binary variable interaction regression model. 

The interaction term in Equation (8.28) allows the population effect on log earn- 
ings (Y;) of having a college degree (changing Dı; from Dı; = 0 to Dı; = 1) to 
depend on sex (D3;).To show this mathematically, calculate the population effect of 
a change in Dj; using the general method laid out in Key Concept 8.1. The first step 
is to compute the conditional expectation of Y; for Dı; = 0 given a value of D3; this 
is E(Y;|Du =0, Dy; = d2) = By + Bi X 0 + By X dy + Bs X (0 X d2) = Bo + Bod, 
where we use the conditional mean zero assumption, E(u; |Dj;, Dz;) = 0. The 


next step is to compute the conditional expectation of Y; after the change—that is, 
for D; =1—given the same value of Dy; this is E(Y,|D,, = 1, 
Dy = d2) = By + Bi X 1 + B X dy + Bs X (1 X d2) = Bo + Bi + Boda + Bsdo. 
The effect of this change is the difference of expected values [that is, the difference 
in Equation (8.4)], which is 


E(Y;| Du = 1, Do; = d2) — E(Y;| Di; = 0, Do; = d2) = Bı + zdz. (8.29) 
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A Method for Interpreting Coefficients 
in Regressions with Binary Variables 83 


First, compute the expected values of Y for each possible case described by the 
set of binary variables. Next compare these expected values. Each coefficient can 
then be expressed either as an expected value or as the difference between two 
or more expected values. 


Thus, in the binary variable interaction specification in Equation (8.28), the effect of 
acquiring a college degree (a unit change in D,;) depends on the person’s sex [the 
value of Dz; which is dz in Equation (8.29)]. If the person is male (d = 0), the effect 
of acquiring a college degree is 6&4, but if the person is female (d) = 1), the effect is 
Bı + B. The coefficient $, on the interaction term is the difference in the effect of 
acquiring a college degree for women versus that for men. 

Although this example was phrased using log earnings, having a college degree, 
and sex, the point is a general one. The binary variable interaction regression allows 
the effect of changing one of the binary independent variables to depend on the 
value of the other binary variable. 

The method we used here to interpret the coefficients was, in effect, to work 
through each possible combination of the binary variables. This method, which 
applies to all regressions with binary variables, is summarized in Key Concept 8.3. 


Application to the student-teacher ratio and the percentage of English learners. Let 
HiSTR; be a binary variable that equals 1 if the student-teacher ratio is 20 or more 
and that equals 0 otherwise, and let HiEL; be a binary variable that equals 1 if the 
percentage of English learners is 10% or more and that equals 0 otherwise. The 
interacted regression of test scores against HiSTR; and HiE L; is 


oe 
TestScore = 664.1 — 1.9 HiSTR — 18.2 HiEL — 3.5(HiSTR x HiEL), 
(1.4) (1.9) (2.3) (3.1) 
R = 0.290. (8.30) 


The predicted effect of moving from a district with a low student-teacher ratio to one 
with a high student-teacher ratio, holding constant whether the percentage of Eng- 
lish learners is high or low, is given by Equation (8.29), with estimated coefficients 
replacing the population coefficients. According to the estimates in Equation (8.30), 
this effect thus is — 1.9 — 3.5HiEL. That is, if the fraction of English learners is low 
(HiEL = 0),then the effect on test scores of moving from HiSTR = Oto HiSTR = 1 
is for test scores to decline by 1.9 points. If the fraction of English learners is high, 
then test scores are estimated to decline by 1.9 + 3.5 = 5.4 points. 


300 


CHAPTER 8 Nonlinear Regression Functions 


The estimated regression in Equation (8.30) also can be used to estimate the 
mean test scores for each of the four possible combinations of the binary variables. 
This is done using the procedure in Key Concept 8.3. Accordingly, the sample average 
test score for districts with HiSTR; = 0 (low student-teacher ratios) and HiEL; = 0 
(low fractions of English learners) is 664.1. For districts with HiSTR; = 1 (high 
student-teacher ratios) and HiEL; = 0 (low fractions of English learners), the 
sample average is 662.2 (= 664.1 — 1.9). When HiSTR; = 0 and HiEL; = 1, the 
sample average is 645.9 (= 664.1 — 18.2), and when HiSTR; = 1 and HiEL; = 1, 
the sample average is 640.5 (= 664.1 — 1.9 — 18.2 — 3.5). 


Interactions Between a Continuous 
and a Binary Variable 


Next consider the population regression of log earnings | Y; = In( Earnings;) | against 
one continuous variable, the individual’s years of work experience (X;), and one 
binary variable, whether the worker has a college degree (D;, where D; = 1 if the i™ 
person is a college graduate). As shown in Figure 8.8, the population regression line 
relating Y and the continuous variable X can depend on the binary variable D in 
three different ways. 

In Figure 8.8(a), the two regression lines differ only in their intercept. The cor- 
responding population regression model is 


Y; = Bo + BX; + BoD; + u; (8.31) 


This is the familiar multiple regression model with a population regression function 
that is linear in X; and D;. When D; = 0, the population regression function is 
Bo + BX; so the intercept is By and the slope is 64. When D; = 1, the population 
regression function is By + 61X; + Bo, so the slope remains £, but the intercept is 
Bo + Bo. Thus $; is the difference between the intercepts of the two regression lines, 
as shown in Figure 8.8(a). Stated in terms of the earnings example, 6, is the effect on 
log earnings of an additional year of work experience, holding college degree status 
constant, and B is the effect of a college degree on log earnings, holding years of 
experience constant. In this specification, the effect of an additional year of work 
experience is the same for college graduates and nongraduates; that is, the two lines 
in Figure 8.8(a) have the same slope. 

In Figure 8.8(b), the two lines have different slopes and intercepts. The different 
slopes permit the effect of an additional year of work to differ for college graduates and 
nongraduates. To allow for different slopes, add an interaction term to Equation (8.31): 


Y; = Po + BAG + BD; + B(X; X D;) + u; (8.32) 


where X; X D;is anew variable, the product of X; and D;. To interpret the coefficients 
of this regression, apply the procedure in Key Concept 8.3. Doing so shows that if 


Xe 


AILLE: Regression Functions Using Binary and Continuous Variables 


Y | (Bo +B2)+BiX KS 
Slope = B1+B3 
Bo +B2 Bo +B2 
Bo Bo 
X X 
(a) Different intercepts, same slope (b) Different intercepts, different slopes 
$ Bot (Bi +B2)X 
Slope = B1+B. 
Bo 


(c) Same intercept, different slopes 


Interactions of binary variables and continuous variables can produce three different population regression functions: 
(a) Bo + BiX + ß2D allows for different intercepts but has the same slope, (b) By + BX + BoD + B3(X X D) allows 
for different intercepts and different slopes, and (c) By + BX + B2(X X D) has the same intercept but allows for 


different slopes. 
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(Bo +B2)+(Bi +B3)X 
Y 


D; = 0, the population regression function is By + 61X; whereas if D; = 1, the popu- 
lation regression function is (By + Bo) + (B1 + B3)X;.Thus this specification allows 
for two different population regression functions relating Y, and X;, depending on the 
value of Dj, as is shown in Figure 8.8(b). The difference between the two intercepts is 
P>, and the difference between the two slopes is 83. In the earnings example, 64 is the 
effect of an additional year of work experience for nongraduates (D; = 0), and 
Bı + Bis this effect for graduates, so B; is the difference in the effect of an additional 
year of work experience for college graduates versus that for nongraduates. 

A third possibility, shown in Figure 8.8(c), is that the two lines have different 
slopes but the same intercept. The interacted regression model for this case is 


Y; = By + Bix; + B(X; X D;) + u; (8.33) 


The coefficients of this specification also can be interpreted using Key Concept 8.3. In 
terms of the earnings example, this specification allows for different effects of 
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Interactions Between Binary and Continuous Variables 


Through the use of the interaction term X; X D, the population regression line 
relating Y; and the continuous variable X; can have a slope that depends on the 
binary variable D;. There are three possibilities: 


1. Different intercepts, same slope (Figure 8.8a): 
Y; = Bo + PiX; + BD; + us 
2. Different intercepts and slopes (Figure 8.8b): 
Yi = Bo + BX; + BoD; + B(X; X Dj) + u; 
3. Same intercept, different slopes (Figure 8.8c): 


By BiG ty X ID) E he 


experience on log earnings between college graduates and nongraduates, but it 
requires that expected log earnings be the same for both groups when they have no 
prior experience. Said differently, this specification corresponds to the population 
mean entry-level wage being the same for college graduates and nongraduates. This 
does not make much sense in this application, and in practice, this specification is used 
less frequently than Equation (8.32), which allows for different intercepts and slopes. 

All three specifications — Equations (8.31), (8.32), and (8.33)—are versions of 
the multiple regression model of Chapter 6, and once the new variable X; X D; is 
created, the coefficients of all three can be estimated by OLS. 

The three regression models with a binary and a continuous independent vari- 
able are summarized in Key Concept 8.4. 


Application to the student-teacher ratio and the percentage of English 
learners. Does the effect on test scores of cutting the student-teacher ratio depend 
on whether the percentage of students still learning English is high or low? One way 
to answer this question is to use a specification that allows for two different regres- 
sion lines, depending on whether there is a high or a low percentage of English learn- 
ers. This is achieved using the different intercept/different slope specification: 


-a 
TestScore = 682.2 — 0.97 STR + 5.6HiEL — 1.28(STR X HiEL), 


(11.9) (0.59) (19.5) (0.97) 
R? = 0.305, (8.34) 


where the binary variable HiEL; equals 1 if the percentage of students still learning 
English in the district is greater than 10% and equals 0 otherwise. 
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For districts with a low fraction of English learners (HiEL; = 0), the estimated 
regression line is 682.2 — 0.97STR,. For districts with a high fraction of Eng- 
lish learners (HiEL; = 1), the estimated regression line is 682.2 + 5.6 — 
0.97STR; — 1.28STR; = 687.8 — 2.25STR;. According to these estimates, reducing 
the student-teacher ratio by 1 is predicted to increase test scores by 0.97 points in 
districts with low fractions of English learners but by 2.25 points in districts with high 
fractions of English learners. The difference between these two effects, 1.28 points, is 
the coefficient on the interaction term in Equation (8.34). 

The interaction regression model in Equation (8.34) allows us to estimate the 
effect of more nuanced policy interventions than the across-the-board class size reduc- 
tion considered so far. For example, suppose the state considered a policy to reduce 
the student-teacher ratio by 2 in districts with a high fraction of English learners 
( HiEL; = 1) but to leave class size unchanged in other districts. Applying the method 
of Key Concept 8.1 to Equations (8.32) and (8.34) shows that the estimated effect of 
this reduction for the districts for which HiEL = 1 is —2( pi + Bs) = 4.50. The 
standard error of this estimated effect is SE(-2B; = 26s) = 1.53, which can be 
computed using Equation (8.8) and the methods of Section 73. 

The OLS regression in Equation (8.34) can be used to test several hypotheses 
about the population regression line. First, the hypothesis that the two lines are, in 
fact, the same can be tested by computing the F-statistic testing the joint hypothesis 
that the coefficient on HiEL; and the coefficient on the interaction term STR; X HiEL; 
are both 0. This F-statistic is 89.9, which is significant at the 1% level. 

Second, the hypothesis that two lines have the same slope can be tested by 
testing whether the coefficient on the interaction term is 0. The t-statistic, 
—1.28/0.97 = —1.32, is less than 1.64 in absolute value, so the null hypothesis that 
the two lines have the same slope cannot be rejected using a two-sided test at the 
10% significance level. 

Third, the hypothesis that the lines have the same intercept corresponds to the 
restriction that the population coefficient on HiEL is 0. The t-statistic testing this 
restriction is £ = 5.6/19.5 = 0.29, so the hypothesis that the lines have the same 
intercept cannot be rejected at the 5% level. 

These three tests produce seemingly contradictory results: The joint test using 
the F-statistic rejects the joint hypothesis that the slope and the intercept are the 
same, but the tests of the individual hypotheses using the t-statistic fail to reject. The 
reason is that the regressors, HiEL and STR Xx HiEL, are highly correlated. This 
results in large standard errors on the individual coefficients. Even though it is impos- 
sible to tell which of the coefficients is nonzero, there is strong evidence against the 
hypothesis that both are 0. 

Finally, the hypothesis that the student-teacher ratio does not enter this specifi- 
cation can be tested by computing the F-statistic for the joint hypothesis that the 
coefficients on STR and on the interaction term are both 0. This F-statistic is 5.64, 
which has a p-value of 0.004. Thus the coefficients on the student-teacher ratio are 
jointly statistically significant at the 1% significance level. 
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The Effect of Ageing on Healthcare Expenditures: A Red Herring? 


| n Western Europe, the number of old people in 
the total population is increasing on average, with 
a greater proportion of the post-World War II “baby 
boom” generation reaching retirement age. 

This has led to researchers becoming increas- 
ingly interested in the impact of ageing on healthcare 
expenditures (HCE), which refers to the amount spent 
on improving people’s health and on health-related 
issues, in recent decades. Initial estimates published 
by the Organisation for Economic Co-operation and 
Development (OECD) painted a very pessimistic 
picture: because older people had, on average, higher 
HCE, an ageing population would place an associated 
upward pressure on public finances. 

Intuitively, this seems to make sense. However, 
other researchers noticed a problem with this logic. 
If people age more healthily, what does this mean 
for HCE? A consensus emerged in the academic 
literature that what determines HCE is not ageing 
per se, but an individual’s proximity to death (“time- 
to-death? or TTD). In terms of these expenditures, an 
80 year old who dies at age 85 is more similar to a 70 
year old who dies at age 75, than to another 80 year 
old who dies at the age of 100. Under this logic, age- 
ing itself became termed a “red herring” in explaining 
HCE — that is, something that acts as a proxy for their 
actual determinants. Time-to-death is regarded as omit- 
ted variable in previous regressions explaining HCE. 

When carrying out regressions of healthcare 
expenditures, the dependent variable employed is 
generally the logarithm of HCE, or a “log-transform” 
of HCE. An example of such a regression is evident 
in a 2015 study that was conducted on two samples 
of around 40,000 individuals each, from England, a) 
who used inpatient health care during 2005-06 and 
died by 2011-12 and b) who had some hospital utili- 
zation since 2005-06 but died in 2011-12. Based on the 


data from this study, Table 8.1 presents the results of a 
regression with a dependent variable of HCE for men 
in England between 2005-06 and 2011-12 (Howdon 
and Rice, 2018). 

How do we interpret this output? It is important to 
remember that our dependent variable is not HCE, but 
their log transform, and that we are dealing with age and 
age! as parameters. So using the coefficients from col- 
umn (1), we compute the average percentage increase 
in healthcare expenditures for ageing from 80 to 81 as, 
1 X —0.1459 + (817 — 807) x 0.00010 = 0.00151, or 
a 0.151% increase. 

What happens when we include (the log of) TTD? 
We observe in column (2) that the age and age! coeffi- 
cients fall in absolute terms, there is a reduction in statis- 
tical significance attached to these coefficients, and that 
log(TTD) is highly significant in explaining log(HCE). 
This suggests that TTD is indeed an omitted variable 
in this regression. Since both of these variables are log- 
transformed, our results suggest that being 1% further 
away from death (a 1% increase in TTD) is associated 
with an average decrease in HCE of around 0.42%. 

But is this the end of the story? Further research 
has pointed to TTD itself as a “red herring,” with 
TTD itself merely proxying for individual morbidity. 
Measures of morbidity, under this logic, would be an 
omitted variable in such regressions—and this would 
be important in predicting future HCE if people not 
only age more healthily, but approach death more 
healthily! And this is exactly what we observe in col- 
umn (3) of Table 8.1: the inclusion of morbidity con- 
trols reduces both the size and statistical significance 
of TTD and age-related coefficients, suggesting that 
TTD indeed proxies for morbidity. It is important to 
remember that determining the relevant variables to 
include in regression analysis depends on the exact 


nature of the question you are trying to answer. 


'For further reading, see CHE Research Paper 107, “Health Care Expenditures, Age, Proximity to Death and Morbidity: 
Implications for an Ageing Population,” 57 (Supplement C), 60-74, by Daniel Howdon and Nigel Rice. 
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( TABLE 8.1 The Relationships Between Age, TTD and Morbidities, and HCE | 
Dependent variable: logarithm of Healthcare expenditures. 
Regressor (1) (2) (3) 
Age —0.01459"" —0.01274" —0.00518 
(0.00654) (0.00652) (0.00526) 
Age! 0.00010** 0.00009** 0.00003 
(0.00004) (0.00004) (0.00003) 
Log(TTD) —0.42375°"" —0.14454"™" 
(0.01467) (0.01206) 
Morbidities Included (Jointly ™™*) 
Key: *** Significant at 1% level, ** Significant at 5% level, * Significant at 10% level. Standard errors in meee 


< 


Interactions Between Two Continuous Variables 


Now suppose that both independent variables (X,; and Xz;) are continuous. An 
example is when Y, is log earnings of the i worker, Xj; is his or her years of work 
experience, and X}; is the number of years he or she went to school. If the population 
regression function is linear, the effect on wages of an additional year of experience 
does not depend on the number of years of education, or, equivalently, the effect of 
an additional year of education does not depend on the number of years of work 
experience. In reality, however, there might be an interaction between these two 
variables, so that the effect on wages of an additional year of experience depends 
on the number of years of education. This interaction can be modeled by augment- 
ing the linear regression model with an interaction term that is the product of Xj; 
and X; 


Y; = Bo + BiX1; + B2Xzi + B3( Xi; X Xi) + ui (8.35) 
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The interaction term allows the effect of a unit change in X; to depend on X. To see 
this, apply the general method for computing effects in nonlinear regression models 
in Key Concept 8.1. The difference in Equation (8.4), computed for the interacted 
regression function in Equation (8.35),is AY = (61 + B3X2)AX;, [Exercise 8.10(a)]. 
Thus the effect on Y of a change in Xj, holding X, constant, is 


zy = Bit pAn (8.36) 


which depends on X. For example, in the earnings example, if B3 is positive, then the 
effect on log earnings of an additional year of experience is greater, by the amount 
B, for each additional year of education the worker has. 

A similar calculation shows that the effect on Y of a change AX; in X, holding 
X, constant, is A Y/ AX = (Bo + BX). 

Putting these two effects together shows that the coefficient 6; on the interaction 
term is the effect of a unit increase in X, and X, above and beyond the sum of the 
effects of a unit increase in X, alone and a unit increase in X, alone. That is, if X, 
changes by AX, and X, changes by AX, then the expected change in Y is 
AY = (Bi + B3X2)AX, + (Bo + B3X,)AX + B,AX,AX, [Exercise 8.10(c)]. The 
first term is the effect from changing X, holding X, constant; the second term is 
the effect from changing X, holding X; constant; and the final term, B;A X AX, is the 
extra effect from changing both X, and X. 

Interactions between two variables are summarized as Key Concept 8.5. 

When interactions are combined with logarithmic transformations, they can be 
used to estimate price elasticities when the price elasticity depends on the characteris- 
tics of the good (see the box “The Demand for Economics Journals” for an example). 


Interactions in Multiple Regression 


8.5 


The interaction term between the two independent variables X, and X; is their 
product X, X X. Including this interaction term allows the effect on Y of a change 
in X; to depend on the value of X, and, conversely, allows the effect of a change 
in X to depend on the value of X. 

The coefficient on X, X X; is the effect of a one-unit increase in X, and X, 
above and beyond the sum of the individual effects of a unit increase in X; alone and 
a unit increase in X; alone. This is true whether X; and/or X is continuous or binary. 


8.3 


The Demand for Economics Journals 


Po economists follow the most recent 
research in their areas of specialization. Most 
research in economics first appears in economics 
journals, so economists—or their libraries —sub- 
scribe to economics journals. 

How elastic is the demand by libraries for econom- 
ics journals? To find out, we analyzed the relationship 
between the number of subscriptions to a journal at 
US. libraries (Y;) and the journal’s library subscription 
price using data for the year 2000 for 180 economics 
journals. Because the product of a journal is the ideas 


it contains, its price is logically measured not in dol- 
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lars per year or dollars per page but instead in dol- 
lars per idea. Although we cannot measure “ideas” 
directly, a good indirect measure is the number of 
times that articles in a journal are subsequently cited 
by other researchers. Accordingly, we measure price 
as the “price per citation” in the journal. The price 
range is enormous, from 4¢ per citation (the American 
Economic Review) to 20¢ per citation or more. Some 
journals are expensive per citation because they have 
few citations and others because their library sub- 
scription price per year is very high. In 2017 a library 


print subscription to the Journal of Econometrics 


Subscriptions 
1200 


15 20 25 
Price per citation 


(a) Subscriptions and price per citation 


In(Subscriptions) 


Sp Demand when Age = 5 


L Demand when 
| Age = 80 


-6 -5 -4 -3 -2 -1 0 1 2 3 4 
In(Price per citation) 


oOorRFN WOW FU DN 


(c) In(Subscriptions) and In(Price per citation) 


m : TE : a 
KAE: Library Subscriptions and Prices of Economics Journals 


In(Subscriptions) 


-6 -5 -4 -3 -2 -1 0 1 2 3 4 
In(Price per citation) 


(b) In(Subscriptions) and In(Price per citation) 


There is a nonlinear inverse relation between the 
number of U.S. library subscriptions (quantity) and 
the library price per citation (price), as shown in 
Figure 8.9a for 180 economics journals in 2000. But 
as seen in Figure 8.9b, the relation between log 
quantity and log price appears to be approximately 
linear. Figure 8.9c shows that demand is more 
elastic for young journals (Age = 5) than for old 
journals (Age = 80). 


continued on next page 
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LNNT: Estimates of the Demand for Economics Journals 


(1) 


Regressor 


In(Price per citation) 

[In(Price per citation)|’ 

[In(Price per citation)’ 

In(Age) 

In(Age) X In(Price per citation) 
In(Characters + 1,000,000) 
F-Statistics and Summary Statistics 


F-statistic testing coefficients on quadratic 
and cubic terms (p-value) 


SER 


R? 0.555 


cost $5363, compared to only $940 for a bundled 
subscription to all eight journals published by the 
American Economics Association, including the 
American Economic Review! 

Because we are interested in estimating elastici- 
ties, we use a log-log specification (Key Concept 8.2). 
The scatterplots in Figures 8.9a and 8.9b provide 
empirical support for this transformation. Because 
some of the oldest and most prestigious journals 
are the cheapest per citation, a regression of log 
quantity against log price could have omitted vari- 
able bias. Our regressions therefore include two 


control variables: the logarithm of age and the 


Dependent variable: logarithm of subscriptions at U.S. libraries in the year 2000; 180 observations. 


—0.533 
(0.034) 


0.750 


The F-statistic tests the hypothesis that the coefficients on [In(Price per citation) |? and [In (Price per citation) |? are 
both 0. All regressions include an intercept (not reported in the table). Standard errors are given in parentheses under 
coefficients, and p-values are given in parentheses under F-statistics. 


(2) (3) (4) 
—0.408 —0.961 —0.899 
(0.044) (0.160) (0.145) 
0.017 
(0.025) 
0.0037 
(0.0055) 
0.424 0.373 0.374 
(0.119) (0.118) (0.118) 
0.156 0.141 
(0.052) (0.040) 
0.206 0.235 0.229 
(0.098) (0.098) (0.096) 
0.25 
(0.779) 
0.705 0.691 0.688 
0.607 0.622 0.626 


J 


logarithm of the number of characters per year in 
the journal. 

The regression results are summarized in Table 8.2. 
Those results yield the following conclusions (see if 
you can find the basis for these conclusions in the 
table!): 


1. Demand is less elastic for older than for newer 
journals. 

2. The evidence supports a linear, rather than a 
cubic, function of log price. 


3. Demand is greater for journals with more 


characters, holding price and age constant. 


8.3 


So what is the elasticity of demand for econom- 
ics journals? It depends on the age of the journal. 
Demand curves for an 80-year-old journal and a 
5-year-old upstart are superimposed on the scat- 
terplot in Figure 8.9c; the older journal’s demand 
elasticity is —0.28 (SE = 0.06), while the younger 
journal’s is —0.67(SE = 0.08). 

This demand is very inelastic: Demand is very 
insensitive to price, especially for older journals. For 


libraries, having the most recent research on hand 
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is a necessity, not a luxury. By way of comparison, 
experts estimate the demand elasticity for cigarettes 
to be in the range of —0.3 to —0.5. Economics jour- 
nals are, it seems, as addictive as cigarettes but a lot 


better for your health!! 


These data were graciously provided by Professor 
Theodore Bergstrom of the Department of Economics 
at the University of California, Santa Barbara. If you are 
interested in learning more about the economics of eco- 
nomics journals, see Bergstrom (2001). 
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Application to the student-teacher ratio and the percentage of English learners. The 
previous examples considered interactions between the student-teacher ratio and a 
binary variable indicating whether the percentage of English learners is large or 
small. A different way to study this interaction is to examine the interaction between 
the student-teacher ratio and the continuous variable, the percentage of English 
learners (PctEL). The estimated interaction regression is 


Ts 
TestScore = 686.3 — 112STR — 0.67PctEL + 0.0012(STR X PctEL), 
(11.8) (0.59) (0.37) (0.019) 


R = 0.422. (8.37) 


When the percentage of English learners is at the median (PctEL = 8.85), the 
slope of the line relating test scores and the student-teacher ratio is estimated to 
be —1.11 (= —1.12 + 0.0012 x 8.85). When the percentage of English learners is 
at the 75th percentile (PctEL = 23.0), this line is estimated to be slightly flatter, 
with a slope of —1.09 (= —1.12 + 0.0012 X 23.0).That is, for a district with 8.85% 
English learners, the estimated effect of a one-unit reduction in the student-teacher 
ratio is to increase test scores by 1.11 points, but for a district with 23.0% English 
learners, reducing the student-teacher ratio by one unit is predicted to increase test 
scores by only 1.09 points. The difference between these estimated effects is not 
statistically significant, however: The t-statistic testing whether the coefficient on 
the interaction term is 0 is £ = 0.0012/0.019 = 0.06, which is not significant at the 
10% level. 

To keep the discussion focused on nonlinear models, the specifications in 
Sections 8.1 through 8.3 exclude additional control variables such as the students’ 
economic background. Consequently, these results arguably are subject to omitted 
variable bias. To draw substantive conclusions about the effect on test scores of 
reducing the student-teacher ratio, these nonlinear specifications must be augmented 
with control variables, and it is to such an exercise that we now turn. 
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8.4 


Nonlinear Effects on Test Scores 
of the Student-Teacher Ratio 


This section addresses three specific questions about test scores and the student- 
teacher ratio. First, after controlling for differences in economic characteristics of 
different districts, does the effect on test scores of reducing the student-teacher ratio 
depend on the fraction of English learners? Second, does this effect depend on the 
value of the student-teacher ratio? Third, and most important, after taking economic 
factors and nonlinearities into account, what is the estimated effect on test scores of 
reducing the student-teacher ratio by two students per teacher, as our superinten- 
dent from Chapter 4 proposes to do? 

We answer these questions by considering nonlinear regression specifications of 
the type discussed in Sections 8.2 and 8.3, extended to include two measures of the 
economic background of the students: the percentage of students eligible for a sub- 
sidized lunch and the logarithm of average district income. The logarithm of district 
income is used because the empirical analysis of Section 8.2 suggests that this speci- 
fication captures the nonlinear relationship between test scores and district income. 
As in Section 7.6, we do not include expenditures per pupil as a regressor, and in so 
doing, we are considering the effect of decreasing the student-teacher ratio, while 
allowing expenditures per pupil to increase (that is, we are not holding expenditures 
per pupil constant). 


Discussion of Regression Results 


The OLS regression results are summarized in Table 8.3. The columns labeled (1) 
through (7) each report separate regressions. The entries in the table are the coeffi- 
cients, standard errors, certain F-statistics and their p-values, and summary statistics, 
as indicated by the description in each row. In addition, the middle block presents 
95% confidence intervals for the estimated effect of reducing the class size by two, 
the question asked by the superintendent. Because some of the specifications are 
nonlinear, the confidence intervals are worked out for various cases, including reduc- 
ing the size of a larger class (22 to 20) or of a moderately-sized class (20 to 18), and 
for the case of high or low fractions of English learners, where the specific cases 
depend on the specifications. 

The first column of regression results, labeled regression (1) in the table, is regres- 
sion (3) in Table 71 repeated here for convenience. This regression does not control 
for district income, so the first thing we do is check whether the results change 
substantially when log income is included as an additional economic control variable. 
The results are given in regression (2) in Table 8.3. The log of income is statistically 
significant at the 1% level, and the coefficient on the student-teacher ratio becomes 
somewhat closer to 0, falling from — 1.00 to — 0.73, although it remains statistically 
significant at the 1% level. The change in the coefficient on STR is large enough 
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LES Nonlinear Regression Models of Test Scores 


Dependent variable: average test score in district; 420 observations. 


Regressor (1) (2) (3) (4) (5) (6) (7) 
Student-teacher ratio (STR) —1.00 —0.73 —0.97 —0.53 64.33 83.70 65.29 
(0.27) (0.26) (0.59) (0.34) (24.86) (28.50) (25.26) 
STR? —3.42 —4.38 —3.47 
(1.25) (1.44) (1.27) 
STR? 0.059 0.075 0.060 
(0.021) (0.024) (0.021) 
% English learners —0.122 —0.176 —0.166 
(0.033) (0.034) (0.034) 
% English learners = 10%? 5.64 5.50 —5.47 816.1 
(Binary, HiEL) (19.51) (9.80) (1.03) (3277) 
HiEL X STR —1.28 —0.58 —123.3 
(0.97) (0.50) (50.2) 
HiEL X STR? 6.12 
(2.54) 
HiEL X STR? —0.101 
(0.043) 


Included Economic Control Variables 

% eligible for subsidized lunch Y 4 N 

Average district income (logarithm) N Y N 

95% Confidence Intervals for the Effect of Reducing STR by 2 

No HiEL interaction [0.93,3.06] [0.46,2.48] 

22 to 20 [0.61, 3.25] [0.54, 3.26] 
20 to 18 [1.64, 4.36] [1.55, 4.30] 
HiEL = 0 [—0.38, 4.25] [—0.28, 2.41] 

22 to 20 [0.40, 3.98] 

20 to 18 [1.22, 4.99] 

HiEL = 1 [1.48, 7.50] [0.80, 3.63] 

22 to 20 [—0.98, 2.91] 

20 to 18 [-0.72, 4.01] 


F-Statistics and p-Values on Joint Hypotheses 


All STR variables 5.64 5.92 6.31 4.96 5.91 
and interactions = 0 (0.004) (0.003) (<0.001) (<0.001) (0.001) 
STR?, STR? = 0 6.17 5.81 5.96 


(<0.001) (0.003) (0.003) 


continued on next page 
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(1) (2) (3) (4) (5) (6) (7) 
HiEL X STR, HiEL X STR’, 2.69 
HiEL x STR? = (0.046) 
SER 9.08 8.64 15.88 8.63 8.56 8.55 8.57 
R2 0.773 0.794 0.305 0.795 0.798 0.799 0.798 


These regressions were estimated using the data on K-8 school districts in California, described in Appendix 4.1. Regressions 
include an intercept and the economic control variables indicated by “Y” or exclude them if indicated by “N” (coefficients 
not shown in the table). Standard errors are given in parentheses under coefficients, and p-values are given in parentheses 
under F-statistics. 


a 


between regressions (1) and (2) to warrant additionally controlling for the logarithm 
of income in the remaining regressions as a deterrent to omitted variable bias. 

Regression (3) in Table 8.3 is the interacted regression in Equation (8.34) with 
the binary variable for a high or low percentage of English learners but with no eco- 
nomic control variables. When the economic control variables (percentage eligible 
for subsidized lunch and log income) are added [regression (4) in the table], the class 
size effect is reduced for both high and low English learner classes; however, the 
confidence intervals are wide in both cases in both regressions. Based on the evi- 
dence in regression (4), the hypothesis that the effect of STR is the same for districts 
with low and high percentages of English learners cannot be rejected at the 5% level 
(the t-statistic is £ = —0.58/0.50 = —1.16). 

Regression (5) examines whether the effect of changing the student-teacher 
ratio depends on the value of the student-teacher ratio by including a cubic specifica- 
tion in STR, controlling for the economic variables in regression (4) [the interaction 
term, HIEL X STR, is not included in regression (5) because it was not significant in 
regression (4) at the 10% level]. The estimates in regression (5) are consistent with 
the student-teacher ratio having a nonlinear effect. The null hypothesis that the rela- 
tionship is linear is rejected at the 1% significance level against the alternative that 
it is a polynomial up to degree 3 (the F-statistic testing the hypothesis that the true 
coefficients on STR? and STR? are 0 is 6.17, with a p-value of < 0.001). The effect of 
reducing the class size from 20 to 18 is estimated to be greater than if it is reduced 
from 22 to 20. 

Regression (6) further examines whether the effect of the student-teacher ratio 
depends not just on the value of the student-teacher ratio but also on the fraction of 
English learners. By including interactions between HiEL and STR, STR’, and STR?, 
we can check whether the (possibly cubic) population regressions functions relating 
test scores and STR are different for low and high percentages of English learners. 
To do so, we test the restriction that the coefficients on the three interaction terms 
are 0. The resulting F-statistic is 2.69, which has a p-value of 0.046 and thus is signifi- 
cant at the 5% but not at the 1% significance level. This provides tentative evidence 
that the regression functions are different for districts with high and low percentages 
of English learners; however, comparing regressions (6) and (4) makes it clear that 
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LO tA Three Regression Functions Relating Test Scores and Student-Teacher Ratio 


The cubic regressions from columns (5) and 
(7) of Table 8.3 are nearly identical. They 
indicate a small amount of nonlinearity 

in the relation between test scores and 
student-teacher ratio. 
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these differences are associated with the quadratic and cubic terms. Moreover, the 


confidence intervals are quite wide in all cases for regression (6). 


Regression (7) is a modification of regression (5), in which the continuous vari- 
able PctEL is used instead of the binary variable HiEL to control for the percentage 


of English learners in the district. The coefficients on the other regressors do not 


change substantially when this modification is made, indicating that the results in 


regression (5) are not sensitive to what measure of the percentage of English learners 


is actually used in the regression. 


In all the specifications, the hypothesis that the student-teacher ratio does not 


enter the regressions is rejected at the 1% level. 


The nonlinear specifications in Table 8.3 are most easily interpreted graphically. 


Figure 8.10 graphs the estimated regression functions relating test scores and the 


student-teacher ratio for the linear specification (2) and the cubic specifications (5) 


and (7), along with a scatterplot of the data.* These estimated regression functions 


show the predicted value of test scores as a function of the student-teacher ratio, 


holding fixed other values of the independent variables in the regression. The esti- 


mated regression functions are all close to one another, although the cubic regres- 


sions flatten out for large values of the student-teacher ratio. 


Regression (6) suggests that the cubic regression functions relating test scores 


and STR might depend on whether the percentage of English learners in the district 


is large or small. Figure 8.11 graphs these two estimated regression functions so that 


“For each curve, the predicted value was computed by setting each independent variable, other than STR, 
to its sample average value and computing the predicted value by multiplying these fixed values of the 
independent variables by the respective estimated coefficients from Table 8.3. This was done for various 
values of STR, and the graph of the resulting adjusted predicted values is the estimated regression function 
relating test scores and the STR, holding the other variables constant at their sample averages. 
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| FIGURE 8.11 | Regression Functions for Districts with High and Low Percentages of English Learners 


Districts with low percentages of English Test score 

learners (HiEL = 0) are shown by gray dots, 720 - 

and districts with HiEL = 1 are shown by o 

colored dots. The cubic regression function 700 F aee ° e. *, 

for HiEL = 1 from regression (6) in Table 8.3 4 e^ =" a _ Regression function 
is approximately 10 points below the cubic 680 H se o?e ? -4 o e (PIEL 0) 


regression function for HiEL = 0 for 
17 = STR = 23, but otherwise the two 660 F 
functions have similar shapes and slopes 


in this range. The slopes of the regression 640 F . : e ê . 
functions differ most for very large and Regression function Fes Pay: oP ere 
small values of STR, for which there are few 620 |_(HiEL = 1) se os ia Do y ° 
observations. Na a 

600 l l 1 ! | | f ! 
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we can see whether this difference, in addition to being statistically significant, is of 
practical importance. As Figure 8.11 shows, for student-teacher ratios between 17 
and 23—a range that includes 88% of the observations— the two functions are sepa- 
rated by approximately 10 points but otherwise are very similar; that is, for STR 
between 17 and 23, districts with a lower percentage of English learners do better, 
holding constant the student-teacher ratio, but the effect of a change in the student- 
teacher ratio is essentially the same for the two groups. The two regression functions 
are different for student-teacher ratios below 16.5, but we must be careful not to read 
more into this than is justified. The districts with STR < 16.5 constitute only 6% of 
the observations, so the differences between the nonlinear regression functions are 
reflecting differences in these very few districts with very low student-teacher ratios. 
Thus, based on Figure 8.11, we conclude that the effect on test scores of a change in 
the student-teacher ratio does not depend on the percentage of English learners for 
the range of student-teacher ratios for which we have the most data. 


Summary of Findings 


These results let us answer the three questions raised at the start of this section. 

First, after controlling for economic background, there is at most weak evidence 
that the effect of a class size reduction depends on whether there are many or few 
English learners in the district. While a class size reduction is estimated to be more 
effective in districts with a high fraction of English learners, the difference in effects 
between high and low English learner districts is imprecisely estimated. Moreover, 
as shown in Figure 8.11, the estimated regression functions have similar slopes in the 
range of student-teacher ratios containing most of the data. 
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Second, after controlling for economic background, there is evidence of a non- 
linear effect on test scores of the student-teacher ratio. The nonlinear estimates sug- 
gest that the effect of reducing the student-teacher ratio is greatest in moderately 
sized classes and is less for very small or very large classes. The null hypothesis of 
linearity can be rejected at the 1% level. 

Third, we now can return to the superintendent’s problem that opened Chapter 4. 
She wants to know the effect on test scores of reducing the student-teacher ratio by 
two students per teacher. In the linear specification (2), this effect does not depend 
on the student-teacher ratio itself, and the estimated effect of this reduction is to 
improve test scores by 1.46 (= —0.73 X —2) points. In the nonlinear specifications, 
this effect depends on the value of the student-teacher ratio. If her district currently 
has a student-teacher ratio of 20 and she is considering cutting it to 18, then based 
on regression (5), the estimated effect of this reduction is to improve test scores by 
3.00 points, with a 95% confidence interval of (1.64, 4.36). If her district currently has 
a student-teacher ratio of 22 and she is considering cutting it to 20, then based on 
regression (5), the estimated effect of this reduction is to improve test scores by 1.93 
points, with a 95% confidence interval of (0.61, 3.25). [Similar results obtain from 
regression (7).] These estimates from the nonlinear specifications thus allow a more 
nuanced answer to her question, based on the characteristics of her district. 


Conclusion 


This chapter presented several ways to model nonlinear regression functions. Because 
these models are variants of the multiple regression model, the unknown coefficients 
can be estimated by OLS, and hypotheses about their values can be tested using t- 
and F-statistics as described in Chapter 7 In these models, the expected effect on Y 
of a change in one of the independent variables, X,, holding the other independent 
variables X>,..., Xą constant, in general, depends on the values of X, Xo,..., Xx. 

There are many different models in this chapter, and you could not be blamed for 
being a bit bewildered about which to use in a given application. How should you analyze 
possible nonlinearities in practice? Section 8.1 laid out a general approach for such an 
analysis, but this approach requires you to make decisions and exercise judgment along 
the way. It would be convenient if there were a single recipe you could follow that would 
always work in every application, but in practice data analysis is rarely that simple. 

The single most important step in specifying nonlinear regression functions is to “use 
your head.” Before you look at the data, can you think of a reason, based on economic 
theory or expert judgment, why the slope of the population regression function might 
depend on the value of that, or another, independent variable? If so, what sort of depen- 
dence might you expect? And, most important, which nonlinearities (if any) could have 
major implications for the substantive issues addressed by your study? Answering these 
questions carefully will focus your analysis. In the test score application, for example, such 
reasoning led us to investigate whether hiring more teachers might have a greater effect 
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in districts with a large percentage of students still learning English, perhaps because 
those students would differentially benefit from more personal attention. By making the 
question precise, we were able to find a precise answer: After controlling for the eco- 
nomic background of the students, the estimated effect of reducing class size effectively 
does not depend on whether there are many or few English learners in the class. 


Summary 


1. In a nonlinear regression, the slope of the population regression function 
depends on the value of one or more of the independent variables. 

2. The effect on Y of a change in the independent variable(s) can be computed by 
evaluating the regression function at two values of the independent variable(s). 
The procedure is summarized in Key Concept 8.1. 

3. A polynomial regression includes powers of X as regressors. A quadratic 
regression includes X and X 2 and a cubic regression includes X, X 2 and X°. 

4. Small changes in logarithms can be interpreted as proportional or percentage 
changes in a variable. Regressions involving logarithms are used to estimate 
proportional changes and elasticities. 

5. The product of two variables is called an interaction term. When interaction 
terms are included as regressors, they allow the regression slope of one variable 
to depend on the value of another variable. 


Key Terms 

quadratic regression model (280) log-linear model (291) 

nonlinear regression function (282) log-log model (293) 

polynomial regression model (286) interaction term (298) 

cubic regression model (287) interacted regressor (298) 
elasticity (289) interaction regression model (298) 
exponential function (289) nonlinear least squares (327) 
natural logarithm (289) nonlinear least squares 

linear-log model (290) estimators (327) 
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Review the Concepts 


8.1 


8.2 


8.3 


8.4 


8.5 


8.6 


A researcher states that there are nonlinearities in the relationship between 
wages and years of schooling. What does this mean? How would you test for 
nonlinearities in the relationship between wages and schooling? How would 
you estimate the rate of change of wages with respect to years of schooling? 


A Cobb-Douglas production function relates production (Q) to factors of 
production — capital (K), labor (L), and raw materials (M)—and an error 
term u using the equation Q = AK*'L’?M*e", where A, Bı, Bo, and f; are 
production parameters. Suppose you have data on production and the factors 
of production from a random sample of firms with the same Cobb-Douglas 
production function. How would you use regression analysis to estimate the 
production parameters? 


How is the slope coefficient interpreted in a log-linear model, where the 
independent variable is in logarithms but the dependent variable is not? In a 
linear-log model? In a log-log model? 


Suppose the regression in Equation (8.30) is estimated using LoSTR and 
LoEL in place of HiSTR and HiEL, where LoSTR = 1 — HiSTRis an indi- 
cator for a low-class-size district and LoEL = 1 — HiEL isan indicator for 
a district with a low percentage of English learners. What are the values of the 
estimated regression coefficients? 


Suppose that in Exercise 8.2 you thought that the value of 6, was not constant 
but rather increased when K increased. How could you use an interaction 
term to capture this effect? 


What types of independent variables— binary or continuous — might interact 
with one another in a regression? Explain how you would interpret the coef- 
ficient on the interaction between two continuous regressors and between two 
binary regressors. 


Exercises 


8.1 


Sales in a company are $243 million in 2018 and increase to $250 million in 2019. 


a. Compute the percentage increase in sales, using the usual formula 


Salesyy,9 — Sales; g ‘ P 
100 x ela ea Compare this value to the approximation 


100 x [In ( Sales 49 ) = ln ( Salesz918) |. 
b. Repeat (a), assuming that Salesy919 = 255, Sales2919 = 260, and 
Salesy 19 = 265. 


c. How good is the approximation when the change is small? Does the 
quality of the approximation deteriorate as the percentage change 
increases? 
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8.2 Suppose a researcher collects data on houses that have sold in a particular 


8.3 


neighborhood over the past year and obtains the regression results in the fol- 
lowing table. 


a. Using the results in column (1), what is the expected change in price of 
building a 1500-square-foot addition to a house? Construct a 99% confi- 
dence interval for the percentage change in price. 


b. How is the coefficient on In(Size) interpreted in column (2)? What is the 
effect of a doubling of the size of a house on its price? 


c. Using column (2), what is the estimated effect of view on price? Con- 
struct a 99% confidence interval for this effect. Is the effect statistically 
different from 0? 


d. Using the results from the regression in column (3), calculate the effect 
of adding two bedrooms to a house. Is the effect statistically significant? 
Which of the two variables—size or number of bedrooms—do you think 
is relatively more important in determining the price of a house? 


e. Is the coefficient on condition significant in column (4)? 


f. Is the interaction term between Pool and View statistically significant in 
column (5)? Find the effect of adding a view on the price of a house with 
a pool, as well as a house without a pool. 


After reading this chapter’s analysis of test scores and class size, an educa- 
tor comments, “In my experience, student performance depends on class 
size, but not in the way your regressions say. Rather, students do well when 
class size is less than 20 students and do very poorly when class size is 
greater than 25. There are no gains from reducing class size below 20 stu- 
dents, the relationship is constant in the intermediate region between 20 
and 25 students, and there is no loss to increasing class size when it is 
already greater than 25.” The educator is describing a threshold effect, in 
which performance is constant for class sizes less than 20, jumps and is 
constant for class sizes between 20 and 25, and then jumps again for class 
sizes greater than 25. To model these threshold effects, define the binary 
variables 


STRsmall = 1if STR < 20, and STRsmall = 0 otherwise; 


STRmoderate = 1 if 20 < STR < 25, and STRmoderate = 0 otherwise; and 
STRlarge = 1if STR > 25, and STRlarge = 0 otherwise. 


a. Consider the regression TestScore; = By + BySTRsmall; + BoSTRlarge; + uj. 
Sketch the regression function relating TestScore to STR for hypothetical 
values of the regression coefficients that are consistent with the educator’s 
statement. 


Regression Results for Exercise 8.2 


Dependent variable: In(Price) 


Regressor (1) 

Size 0.00042 
(0.000038) 

In(Size) 

[In (Size) ]* 

Bedrooms 

Pool 0.082 
(0.032) 

View 0.037 
(0.029) 

Pool X View 

Condition 0.13 
(0.045) 

Intercept 10.97 
(0.069) 


Summary Statistics 
SER 0.1026 


R2 0.0710 


Variable definitions: Price = sale price ($); Size = house size (in square feet); Bedrooms = number of bedrooms; Pool = 
binary variable (1 if house has a swimming pool, 0 otherwise); View = binary variable (1 if house has a nice view, 0 other- 
wise); Condition = binary variable (1 if real estate agent reports house is in excellent condition, 0 otherwise). 


% 


(2) 


0.69 
(0.054) 


0.071 
(0.034) 


0.027 
(0.028) 


0.12 
(0.035) 


6.60 
(0.39) 


1.023 


0.0761 


(3) 


0.68 
(0.087) 


0.0036 
(0.037) 


0.071 
(0.034) 


0.026 
(0.026) 


0.12 
(0.035) 


6.63 
(0.53) 


(4) 


0.57 
(2.03) 


0.0078 
(0.14) 


0.071 
(0.036) 


0.027 
(0.029) 


0.12 
(0.036) 


7.02 
(7.50) 
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(5) 


0.69 
(0.055) 


0.071 
(0.035) 


0.027 
(0.030) 


0.0022 
(0.10) 


0.12 
(0.035) 


6.60 
(0.40) 


1.020 


0.0814 


= 


b. A researcher tries to estimate the regression TestScore; = By + 
BıSTRsmall; + BSTRmoderate; + B3STRlarge; + u; and finds that the 
software gives an error message. Why? 


8.4 Read the box “The Effect of Ageing on Healthcare Expenditures: A Red 
Herring?” in Section 8.3. 


a. Consider a male aged 60 years. Use the results from column (1) of 
Table 8.1 and the method in Key Concept 8.1 to estimate the expected 
change in the logarithm of health care expenditures (HCE) associated 


with an additional year of age. 


b. Repeat (a), assuming a man aged 70 years. 


c. Explain why the answers to (a) and (b) are different. 
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d. Is the difference in the answers to (a) and (b) statistically significant at 
the 5% level? Explain. 


e. How would you change the regression if you suspected that the effect of 
age on HCE was different for men than for women? 


8.5 Read the box “The Demand for Economics Journals” in Section 8.3. 


a. The box reaches three conclusions. Looking at the results in the table, 
what is the basis for each of these conclusions? 


b. Using the results in regression (4), the box reports that the elasticity of 
demand for an 80-year-old journal is —0.28. 


i. How was this value determined from the estimated regression? 


ii. The box reports that the standard error for the estimated elasticity 
is 0.06. How would you calculate this standard error? (Hint: See the 
discussion in “Standard errors of estimated effects” on page 284.) 


c. Suppose the variable Characters had been divided by 1000 instead of 
1,000,000. How would the results in column (4) change? 


8.6 Refer to Table 8.3. 


a. A researcher suspects that the effect of % Eligible for subsidized lunch 
has a nonlinear effect on test scores. In particular, he conjectures that 
increases in this variable from 10% to 20% have little effect on test 
scores but that changes from 50% to 60% have a much larger effect. 


i. Describe a nonlinear specification that can be used to model this 
form of nonlinearity. 


ii. How would you test whether the researcher’s conjecture was better 
than the linear specification in column (7) of Table 8.3? 


b. A researcher suspects that the effect of income on test scores is different 
in districts with small classes than in districts with large classes. 


i. Describe a nonlinear specification that can be used to model this 
form of nonlinearity. 


ii. How would you test whether the researcher’s conjecture was better 
than the linear specification in column (7) of Table 8.3? 


8.7 This problem is inspired by a study of the gender gap in earnings in top cor- 
porate jobs (Bertrand and Hallock, 2001). The study compares total com- 
pensation among top executives in a large set of U.S. public corporations in 
the 1990s. (Each year these publicly traded corporations must report total 
compensation levels for their top five executives.) 
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a. Let Female be an indicator variable that is equal to 1 for females and 0 
for males. A regression of the logarithm of earnings on Female yields 


a 
In (Earnings) = 6.48 — 0.44 Female, SER = 2.65. 


(0.01) (0.05) 


i. The estimated coefficient on Female is —0.44. Explain what this value 
means. 
ii. The SER is 2.65. Explain what this value means. 


iii. Does this regression suggest that female top executives earn 
less than top male executives? Explain. 
iv. Does this regression suggest that there is sex discrimination? Explain. 
b. Two new variables, the market value of the firm (a measure of firm size, 
in millions of dollars) and stock return (a measure of firm performance, 
in percentage points), are added to the regression: 


ee ee ae 
In( Earnings) = 3.86 — 0.28 Female + 0.37In( MarketValue) + 0.004 Return, 


(0.03) (0.04) (0.004) (0.003 ) 
n = 46,670, R* = 0.345. 


i. The coefficient on In(MarketValue) is 0.37. Explain what this value 
means. 
ii. The coefficient on Female is now —0.28. Explain why it has changed 
from the regression in (a). 
c. Are large firms more likely than small firms to have female top 
executives? Explain. 
8.8 X is a continuous variable that takes on values between 5 and 100. Z is a 


binary variable. Sketch the following regression functions (with values of X 
between 5 and 100 on the horizontal axis and values of Y on the vertical axis): 


= 2.0 + 3.0 X In(X). 

= 2.0 — 3.0 x In(X). 

e i Y = 2.0 + 3.0 x In(X) + 4.0Z, with Z = 1. 
ii. Same as (i), but with Z = 0. 

d. i. Y = 2.0 + 3.0 X In(X) + 4.0Z — 1.0 X Z X In(X), with Z = 1. 
ii. Same as (i), but with Z = 0. 

e Y = 1.0 + 125.0X — 0.01X”. 


y 
y 


8.9 Explain how you would use approach 2 from Section 73 to calculate the confi- 
dence interval discussed below Equation (8.8). [Hint: This requires estimating 
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8.10 


8.11 


8.12 


a new regression using a different definition of the regressors and the depen- 
dent variable. See Exercise (7.9).] 


Consider the regression model Y, = By + BX; + bX + B3(X4; X Xi) + ui 
Use Key Concept 8.1 to show that 


a. AY/AX, = B, + BX (effect of change in _X;, holding X, constant). 
b. AY/AX, = B + BX, (effect of change in X, holding X, constant). 


c. If X, changes by AX; and X; changes by AX), then AY = 
(Bi + B3%)) AX + (Bo + P3X1)AX + BAX AX). 


Derive the expressions for the elasticities given in Appendix 8.2 for the linear 
and log-log models. (Hint: For the log-log model, assume that u and X are 
independent, as is done in Appendix 8.2 for the log-linear model.) 


The discussion following Equation (8.28) interprets the coefficient on inter- 
acted binary variables using the conditional mean zero assumption. This 
exercise shows that this interpretation also applies under conditional mean 
independence. Consider the hypothetical experiment in Exercise 7.11. 


a. Suppose you estimate the regression Y; = yo + y,X1; + u; using only 
the data on returning students. Show that yj is the class size effect 
for returning students—that is, that y; = E(Y|X; = 1, X; = 0) — 
E(Y;| Xj; = 0, X; = 0). Explain why ĵ; is an unbiased estimator of y4. 

b. Suppose you estimate the regression Y; = 69 + 6,X1; + u; using only 
the data on new students. Show that 6, is the class size effect for new 
students—that is, that 6; = E(Y, |X; = 1, X%; = 1) — E(Y, |X; = 0, 
Xi = 1). Explain why ô is an unbiased estimator of ô}. 

c. Consider the regression for both returning and new students, 
Y, = Bo + BX; + BX + B(X X X) + u; Use the conditional 
mean independence assumption E(u; | Xip X2;) = E(u; |X2;) to show 
that 61 = yı, Bı + B = ô}, and B; = ô — yı (the difference in the class 
size effects). 


d. Suppose you estimate the interaction regression in (c) using the com- 
bined data and E(u; |X, X2;) = E(u; |X). Show that ĝ; and B; are 
unbiased but that ĝ is, in general, biased. 


Empirical Exercises 


E8.1 


Lead is toxic, particularly for young children, and for this reason, government 
regulations severely restrict the amount of lead in our environment. But this 
was not always the case. In the early part of the 20th century, the under- 
ground water pipes in many U.S. cities contained lead, and lead from these 
pipes leached into drinking water. In this exercise, you will investigate the 
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effect of these lead water pipes on infant mortality. On the text website 
http://www.pearsonglobaleditions.com, you will find the data file Lead_ 
Mortality, which contains data on infant mortality, type of water pipes (lead or 


nonlead), water acidity (pH), and several demographic variables for 172 U.S. 


cities in 1900.° A detailed description is given in Lead_Mortality_Description, 


also available on the website. 


a. Compute the average infant mortality rate (Inf) for cities with lead pipes 


and for cities with nonlead pipes. Is there a statistically significant differ- 


ence in the averages? 


b. The amount of lead leached from lead pipes depends on the chemistry of 


the water running through the pipes. The more acidic the water is (that 


is, the lower its pH), the more lead is leached. Run a regression of Inf on 


Lead, pH, and the interaction term Lead X pH. 


1. 


ii: 


iii. 


iv. 


Vi. 


The regression includes four coefficients (the intercept and the three 
coefficients multiplying the regressors). Explain what each coefficient 
measures. 


Plot the estimated regression function relating Inf to pH for Lead = 0 
and for Lead = 1. Describe the differences in the regression functions, 
and relate these differences to the coefficients you discussed in (i). 


Does Lead have a statistically significant effect on infant mortality? 
Explain. 

Does the effect of Lead on infant mortality depend on pH? Is this 
dependence statistically significant? 


. What is the average value of pH in the sample? At this pH level, 


what is the estimated effect of Lead on infant mortality? What is 
the standard deviation of pH? Suppose the pH level is one standard 
deviation lower than the average level of pH in the sample: What is 
the estimated effect of Lead on infant mortality? What if pH is one 
standard deviation higher than the average value? 


Construct a 95% confidence interval for the effect of Lead on infant 
mortality when pH = 6.5. 


c. The analysis in (b) may suffer from omitted variable bias because it 


neglects factors that affect infant mortality and that might potentially be 


correlated with Lead and pH. Investigate this concern, using the other 


variables in the data set. 


E8.2 On the text website http://www.pearsonglobaleditions.com, you will find 


a data file CPS2015, which contains data for full-time, full-year workers, 


‘These data were provided by Professor Karen Clay of Carnegie Mellon University and were used in 
her paper with Werner Troesken and Michael Haines, “Lead and Mortality,” Review of Economics and 
Statistics, 2014, 96(3). 


324 CHAPTER 8 Nonlinear Regression Functions 


ages 25-34, with a high school diploma or B.A./B.S. as their highest degree. A 
detailed description is given in CPS2015_Description, also available on the web- 
site. (These are the same data as in CPS96_15, used in Empirical Exercise 3.1, 
but are limited to the year 2015.) In this exercise, you will investigate the rela- 
tionship between a worker’s age and earnings. (Generally, older workers have 
more job experience, leading to higher productivity and higher earnings.) 


a. Run a regression of average hourly earnings (AHE) on age (Age), sex 
(Female), and education (Bachelor). If Age increases from 25 to 26, how 
are earnings expected to change? If Age increases from 33 to 34, how are 
earnings expected to change? 


b. Run a regression of the logarithm of average hourly earnings, In(AHE), 
on Age, Female, and Bachelor. If Age increases from 25 to 26, how are 
earnings expected to change? If Age increases from 33 to 34, how are 
earnings expected to change? 

ce Runa regression of the logarithm of average hourly earnings, In(AHE), 
on In(Age), Female, and Bachelor. If Age increases from 25 to 26, how 
are earnings expected to change? If Age increases from 33 to 34, how are 
earnings expected to change? 

d. Run a regression of the logarithm of average hourly earnings, In(AHE), 

on Age, Age’, Female, and Bachelor. If Age increases from 25 to 26, how 

are earnings expected to change? If Age increases from 33 to 34, how are 
earnings expected to change? 

Do you prefer the regression in (c) to the regression in (b)? Explain. 


Do you prefer the regression in (d) to the regression in (b)? Explain. 


Do you prefer the regression in (d) to the regression in (c)? Explain. 


> Q m © 


Plot the regression relation between Age and In(AHE) from (b), (c), 
and (d) for males with a high school diploma. Describe the similarities 
and differences between the estimated regression functions. Would your 
answer change if you plotted the regression function for females with 
college degrees? 


i. Run a regression of In(AHE) on Age, Age’, Female, Bachelor, and the 
interaction term Female X Bachelor. What does the coefficient on the 
interaction term measure? Alexis is a 30-year-old female with a bache- 
lor’s degree. What does the regression predict for her value of In(AHE)? 
Jane is a 30-year-old female with a high school diploma. What does the 
regression predict for her value of In(AHE)? What is the predicted dif- 
ference between Alexis’s and Jane’s earnings? Bob is a 30-year-old male 
with a bachelor’s degree. What does the regression predict for his value 
of In(AHE)? Jim is a 30-year-old male with a high school diploma. What 
does the regression predict for his value of In(AHE)? What is the pre- 
dicted difference between Bob’s and Jim’s earnings? 
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j- Is the effect of Age on earnings different for men than for women? Specify 
and estimate a regression that you can use to answer this question. 


k. Is the effect of Age on earnings different for high school graduates than 
for college graduates? Specify and estimate a regression that you can use 
to answer this question. 

l. After running all these regressions (and any others that you want to 
run), summarize the effect of age on earnings for young workers. 


Regression Functions That Are Nonlinear 
in the Parameters 


The nonlinear regression functions considered in Sections 8.2 and 8.3 are nonlinear functions 
of the X’s but are linear functions of the unknown parameters. Because they are linear in the 
unknown parameters, those parameters can be estimated by OLS after defining new regressors 
that are nonlinear transformations of the original X’s. This family of nonlinear regression func- 
tions is both rich and convenient to use. In some applications, however, economic reasoning 
leads to regression functions that are not linear in the parameters. Although such regression 
functions cannot be estimated by OLS, they can be estimated using an extension of OLS called 


nonlinear least squares. 


Functions That Are Nonlinear in the Parameters 


We begin with two examples of functions that are nonlinear in the parameters. We then pro- 


vide a general formulation. 


Logistic curve. Suppose you are studying the market penetration of a technology, such as the 
adoption of machine learning software in different industries. The dependent variable is the 
fraction of firms in the industry that have adopted the software, a single independent 
variable X describes an industry characteristic, and you have data on n industries. The depen- 
dent variable is between 0 (no adopters) and 1 (100% adoption). Because a linear regression 
model could produce predicted values less than 0 or greater than 1, it makes sense to use 
instead a function that produces predicted values between 0 and 1. 

The logistic function smoothly increases from a minimum of 0 to a maximum of 1. The 


logistic regression model with a single X is 


1 
Y, m 
1 + e (Bot BX) 


+ Uj. (8.38) 


The logistic function with a single X and positive values of Bp and £; is graphed in Figure 8.12a. 


As can be seen in the graph, the logistic function has an elongated “S” shape. For small values 
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AP Two Functions That Are Nonlinear in Their Parameters 
Y Y 
1 Bo = 
! 0 
0 x x 
(a) A logistic curve (b) A negative exponential growth curve 
Figure 8.12a plots the logistic function of Equation (8.38), which has predicted values that lie between 0 and 1. 
Figure 8.12b plots the negative exponential growth function of Equation (8.39), which has a slope that is always 
positive and decreases as X increases and an asymptote at fy as X tends to infinity. 


of X, the value of the function is nearly 0, and the slope is flat; the curve is steeper for moderate 


values of X; and for large values of X, the function approaches 1, and the slope is flat again. 


Negative exponential growth. The functions used in Section 8.2 to model the relation between 
test scores and income have some deficiencies. For example, the polynomial models can 
produce a negative slope for some values of income, which is implausible. The logarithmic 
specification has a positive slope for all values of income; however, as income gets very large, 
the predicted values increase without bound, so for some incomes the predicted value for a 
district will exceed the maximum possible score on the test. 

The negative exponential growth model provides a nonlinear specification that has a 
positive slope for all values of income, has a slope that is greatest at low values of income and 
decreases as income rises, and has an upper bound (that is, an asymptote as income increases 


to infinity). The negative exponential growth regression model is 
Y, = Bo[1 — e PXB] + u; (8.39) 


The negative exponential growth function with positive values of By and £; is graphed in Figure 8.12b. 


The slope is steep for low values of X, but as X increases, it reaches an asymptote of £p. 


General functions that are nonlinear in the parameters. The logistic and negative exponential 
growth regression models are special cases of the general nonlinear regression model 


Y; = f( Xi: sneer Xk Po, ashe Bm) + Uj (8.40) 


in which there are k independent variables and m + 1 parameters, Bo,..., Bm. In the models 
of Sections 8.2 and 8.3, the X’s entered this function nonlinearly, but the parameters entered 


linearly. In the examples of this appendix, the parameters enter nonlinearly as well. If the 
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parameters are known, then predicted effects can be computed using the method described in 
Section 8.1. In applications, however, the parameters are unknown and must be estimated from 
the data. Parameters that enter nonlinearly cannot be estimated by OLS, but they can be esti- 


mated by nonlinear least squares. 


Nonlinear Least Squares Estimation 


Nonlinear least squares is a general method for estimating the unknown parameters of a regres- 
sion function when those parameters enter the population regression function nonlinearly. 
Recall the discussion in Section 5.3 of the OLS estimator of the coefficients of the linear 
multiple regression model. The OLS estimator minimizes the sum of squared prediction mis- 
takes in Equation (5.8), £7- [ Y; — (bo + b1X; + +++ + byX,;) |’. In principle, the OLS esti- 


mator can be computed by checking many trial values of bọ, . . . , by and settling on the values 


that minimize the sum of squared mistakes. 
This same approach can be used to estimate the parameters of the general nonlinear 
regression model in Equation (8.40). Because the regression function is nonlinear in the coef- 


ficients, this method is called nonlinear least squares. For a set of trial parameter values 


bo, b1, . . ., bm, construct the sum of squared prediction mistakes: 
> [Y = f( Min... Xe Piecing Bai. (8.41) 
i=1 

The nonlinear least squares estimators of Bo, 8;,..., Bm are the values of bo, by,..., bm that 


minimize the sum of squared prediction mistakes in Equation (8.41). 

In linear regression, a relatively simple formula expresses the OLS estimator as a function 
of the data. Unfortunately, no such general formula exists for nonlinear least squares, so the 
nonlinear least squares estimator must be found numerically using a computer. Regression 
software incorporates algorithms for solving the nonlinear least squares minimization prob- 
lem, which simplifies the task of computing the nonlinear least squares estimator in practice. 

Under general conditions on the function f and the X’s, the nonlinear least squares estima- 
tor shares two key properties with the OLS estimator in the linear regression model: It is con- 
sistent, and it is normally distributed in large samples. In regression software that supports 
nonlinear least squares estimation, the output typically reports standard errors for the esti- 
mated parameters. As a consequence, inference concerning the parameters can proceed as 
usual; in particular, f-statistics can be constructed using the general approach in Key Concept 5.1, 
and a 95% confidence interval can be constructed as the estimated coefficient, plus or minus 
1.96 standard errors. Just as in linear regression, the error term in the nonlinear regression 


model can be heteroskedastic, so heteroskedasticity-robust standard errors should be used. 


Application to the Test Score-District Income Relation 


A negative exponential growth model, fit to district income (X) and test scores (Y), has the desir- 
able features of a slope that is always positive [if 6, in Equation (8.39) is positive] and an asymp- 
tote of By as income increases to infinity. Estimating Bo, £1, and 6 in Equation (8.39) using the 


California test score data yields By = 703.2 (heteroskedasticity-robust standard error = 4.44), 
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| FIGURE 8.13 | The Negative Exponential Growth and Linear-Log Regression Functions 


The negative exponential growth regres- Test score 
sion function [Equation (8.42)] and the 
linear-log regression function [Equation 700 


Linear-log regression 


(8.18)] both capture the nonlinear rela- 
tion between test scores and district e Negative exponential 


income. One difference between the growth regression 
two functions is that the negative expo- 
nential growth model has an asymptote 
as Income increases to infinity, but the 650 F 


linear-log regression function does not. 
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B, = 0.0552 (SE = 0.0068), and Ê, = —34.0 (SE = 4.48). Thus the estimated nonlinear 
regression function (with standard errors reported below the parameter estimates) is 
——~ 
TestScore = 703.2[1 = e 0.0552(Income + 340), 


(4.44) (0.0068) (4.48) (8.42) 


This estimated regression function is plotted in Figure 8.13, along with the logarithmic regres- 
sion function and a scatterplot of the data. The two specifications are, in this case, quite similar. 
One difference is that the negative exponential growth curve flattens out at the highest levels 


of income, consistent with having an asymptote. 


APPENDIX 


8.2 Slopes and Elasticities for Nonlinear 
Regression Functions 


This appendix uses calculus to evaluate slopes and elasticities of nonlinear regression functions 
with continuous regressors. We focus on the case of Section 8.2, in which there is a single X. 
This approach extends to multiple X’s, using partial derivatives. 

Consider the nonlinear regression model, Y, = f(X;) + u; with E(u;|X;) = 0. The 
slope of the population regression function, f(X), evaluated at the point X = x, is the 
derivative of f; that is, df(X) /dX | x- For the polynomial regression function in Equation 
(8.9), (X) = Bo + BX + BX? + +--+ BX" and dX*/dX = aX" ~! for any constant a, so 
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df(X) /dX | y=, = Bı + 2Box +--+ + rB,x"~!. The estimated slope at x is df(X) /dX | y=. = 
Êi + 2ĝx Fasi rĝ xt. The standard error of the estimated slope is SE(B + 2Box + 
HISE rB,x’—');for a given value of x, this is the standard error of a weighted sum of regression 
coefficients, which can be computed using the methods of Section 73 and Equation (8.8). 
The elasticity of Y with respect to X is the percentage change in Y for a given percentage 
change in X. Formally, this definition applies in the limit that the percentage change in X goes 
to 0, so the slope appearing in the definition in Equation (8.22) is replaced by the derivative 


and the elasticity is 


dY X _diny 
dX” Y dinX’ 


elasticity of Y with respect to X = 


In a regression model, Y depends both on X and on the error term u. It is conventional to 
evaluate the elasticity as the percentage change not of Y but of the predicted component of 
Y—that is, the percentage change in E( Y| X). Accordingly, the elasticity of E(Y | X) with 


respect to X is 


dE(Y|X) X  _ d nE(Y|X) 
ax E(Y|X) din X 


The elasticities for the linear model and for the three logarithmic models summarized in 


Key Concept 8.2 are given in the table below. 


a Population Regression Elasticity of E(Y | X) with 

Case Model Respect to X 
linear Y = bot BX +u Bo + BX 
l Bı 
linear-log Y = fo + Biln(X) +u By + Biln(X) 
log-linear In(Y) = By + BX +u BX 
log-log In(Y) = By + Biln(X) +u By 

k 


The log-log specification has a constant elasticity, but in the other three specifications, the 
elasticity depends on X. 

We now derive the expressions for the linear-log and log-linear models. For the linear-log 
model, E(Y|X) = By + bı In(X). Because din(X)/dX = 1/X, applying the chain rule 
yields dE(Y|X)/dX = B,/X. Thus the elasticity is dE(Y|X)/dX X X/E(Y|X) = 
(Bı/ X) X X/[ Bo + Biln(X) ] = £ı/[6o + Biln(X) ], as is given in the table. For the log- 


linear model, it is conventional to make the additional assumption that u and X are indepen- 


dently distributed, so the expression for E(Y | X) given following Equation (8.25) becomes 
E(Y | X) = ce®tPi*, where c = E(e") is a constant that does not depend on X because of 
the additional assumption that u and X are independent. Thus dE(Y | X)/dX = ce®***g,, 
and the elasticity is dE(Y|X)/dX X X/E(Y|X) = cet PXB, x X/(ce®*Pi*) = B.X.The 


derivations for the linear and log-log models are left as Exercise 8.11. 
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2al 


on Multiple Regression 


he preceding five chapters explain how to use multiple regression to analyze the 
[eee among variables in a data set. In this chapter, we step back and ask, 
What makes a study that uses multiple regression reliable or unreliable? We focus on 
statistical studies that have the objective of estimating the causal effect of a change in 
some independent variable, such as class size, on a dependent variable, such as test 
scores. For such studies, when will multiple regression provide a useful estimate of the 
causal effect, and, just as importantly, when will it fail to do so? 

To answer these questions, this chapter presents a framework for assessing 
statistical studies in general, whether or not they use regression analysis. This 
framework relies on the concepts of internal and external validity. A study is 
internally valid if its statistical inferences about causal effects are valid for the 
population and setting studied; it is externally valid if its inferences can be 
generalized to other populations and settings. In Sections 9.1 and 9.2, we discuss 
internal and external validity, list a variety of possible threats to internal and external 
validity, and discuss how to identify those threats in practice. The discussion in 
Sections 9.1 and 9.2 focuses on the estimation of causal effects from observational 
data. Section 9.3 returns to the prediction problem and discusses threats to the 
validity of predictions made using regression models. 

As an illustration of the framework of internal and external validity, in Section 9.4 
we assess the internal and external validity of the study of the effect on test scores of 
cutting the student-teacher ratio presented in Chapters 4 through 8. 


Internal and External Validity 


The concepts of internal and external validity, defined in Key Concept 9.1, provide a 
framework for evaluating whether a statistical or econometric study is useful for 
answering a specific question of interest. 

Internal and external validity distinguish between the population and setting 
studied and the population and setting to which the results are generalized. The 
population studied is the population of entities— people, companies, school districts, 
and so forth—from which the sample was drawn. The population to which the results 
are generalized, or the population of interest, is the population of entities to which 
the causal inferences from the study are to be applied. For example, a high school 
(grades 9 through 12) principal might want to generalize our findings on class sizes 
and test scores in California elementary school districts (the population studied) to 
the population of high schools (the population of interest). 
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Internal and External Validity 


2. 


A Statistical analysis is said to have internal validity if the statistical inferences 
about causal effects are valid for the population being studied. The analysis is said 
to have external validity if its inferences and conclusions can be generalized from 
the population and setting studied to other populations and settings. 


By setting, we mean the institutional, legal, social, physical, and economic 
environment. For example, it would be important to know whether the findings of a 
laboratory experiment assessing methods for growing organic tomatoes could be 
generalized to the field—that is, whether the organic methods that work in the setting 
of a laboratory also work in the setting of the real world. We provide other examples 
of differences in populations and settings later in this section. 


Threats to Internal Validity 


Internal validity has two components. First, the estimator of the causal effect should 
be unbiased and consistent. For example, if Bork is the OLS estimator of the effect 
on test scores of a unit change in the student-teacher ratio in a certain regression, 
then Berg should be an unbiased and consistent estimator of the population causal 
effect of a change in the student-teacher ratio, Bsrp. 

Second, hypothesis tests should have the desired significance level (the actual 
rejection rate of the test under the null hypothesis should equal its desired significance 
level), and confidence intervals should have the desired confidence level. For 
example, if a confidence interval is constructed as ÊsTR + 1.96 SE (Bom) this con- 
fidence interval should contain the true population causal effect, Bsrp, with 95% 
probability over repeated samples drawn from the population being studied. 

In regression analysis, causal effects are estimated using the estimated regression 
function, and hypothesis tests are performed using the estimated regression coeffi- 
cients and their standard errors. Accordingly, in a study based on OLS regression, the 
requirements for internal validity are that the OLS estimator is unbiased and consis- 
tent and that standard errors are computed in a way that makes confidence intervals 
have the desired confidence level. For various reasons, these requirements might not 
be met, and these reasons constitute threats to internal validity. These threats lead to 
failures of one or more of the least squares assumptions in Key Concept 6.4. For 
example, one threat that we have discussed at length is omitted variable bias; it leads 
to correlation between one or more regressors and the error term, which violates the 
first least squares assumption. If data are available on the omitted variable or on an 
adequate control variable, then this threat can be avoided by including that variable 
as an additional regressor. 
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Section 9.2 provides a detailed discussion of the various threats to internal valid- 
ity in multiple regression analysis and suggests how to mitigate them. 


Threats to External Validity 


Potential threats to external validity arise from differences between the population 
and setting studied and the population and setting of interest. 


Differences in populations. Differences between the population studied and the 
population of interest can pose a threat to external validity. For example, laboratory 
studies of the toxic effects of chemicals typically use animal populations like mice 
(the population studied), but the results are used to write health and safety 
regulations for human populations (the population of interest). Whether mice and 
men differ sufficiently to threaten the external validity of such studies is a matter of 
debate. 

More generally, the true causal effect might not be the same in the population 
studied and the population of interest. This could be because the population was 
chosen in a way that makes it different from the population of interest, because of 
differences in characteristics of the populations, because of geographical differences, 
or because the study is out of date. 


Differences in settings. Even if the population being studied and the population of 
interest are identical, it might not be possible to generalize the study results if the 
settings differ. For example, a study of the effect on college binge drinking of an 
antidrinking advertising campaign might not generalize to another, identical group 
of college students if the legal penalties for drinking at the two colleges differ. In this 
case, the legal setting in which the study was conducted differs from the legal setting 
to which its results are applied. 

More generally, examples of differences in settings include differences in the 
institutional environment (public universities versus religious universities), differ- 
ences in laws (differences in legal penalties), and differences in the physical environ- 
ment (tailgate-party binge drinking in southern California versus Fairbanks, Alaska). 


Application to test scores and the student-teacher ratio. Chapters 7 and 8 reported 
statistically significant, but substantively small, estimated improvements in test scores 
resulting from reducing the student-teacher ratio. This analysis was based on test 
results for California school districts. Suppose for the moment that these results are 
internally valid. To what other populations and settings of interest could this finding 
be generalized? 

The closer the population and setting of the study are to those of interest, the 
stronger the case is for external validity. For example, college students and college 
instruction are very different from elementary school students and instruction, so it is 
implausible that the effect of reducing class sizes estimated using the California 
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elementary school district data would generalize to colleges. On the other hand, ele- 
mentary school students, curriculum, and organization are broadly similar throughout 
the United States, so it is plausible that the California results might generalize to 
performance on standardized tests in other U.S. elementary school districts. 


How to assess the external validity of a study. External validity must be judged 
using specific knowledge of the populations and settings studied and those of inter- 
est. Important differences between the two will cast doubt on the external validity of 
the study. 

Sometimes there are two or more studies on different but related populations. If 
so, the external validity of both studies can be checked by comparing their results. 
For example, in Section 9.4, we analyze test score and class size data for elementary 
school districts in Massachusetts and compare the Massachusetts and California 
results. In general, similar findings in two or more studies bolster claims to external 
validity, while differences in their findings that are not readily explained cast doubt 
on their external validity. 


How to design an externally valid study. Because threats to external validity stem 
from a lack of comparability of populations and settings, these threats are best 
minimized at the early stages of a study, before the data are collected. Study design 
is beyond the scope of this textbook, and the interested reader is referred to Shadish, 
Cook, and Campbell (2002). 


Threats to Internal Validity 
of Multiple Regression Analysis 


Studies based on regression analysis are internally valid if the estimated regression 
coefficients are unbiased and consistent for the causal effect of interest and if their 
standard errors yield confidence intervals with the desired confidence level. This sec- 
tion surveys five reasons why the OLS estimator of the multiple regression coeffi- 
cients might be biased, even in large samples: omitted variables, misspecification of 
the functional form of the regression function, imprecise measurement of the inde- 
pendent variables (“errors in variables”), sample selection, and simultaneous causal- 
ity. All five sources of bias arise because the regressor is correlated with the error 
term in the population regression, violating the first least squares assumption in 


‘A comparison of many related studies on the same topic is called a meta-analysis. The discussion in the 
box “The Mozart Effect: Omitted Variable Bias?” in Chapter 6 is based on a meta-analysis, for example. 
Performing a meta-analysis of many studies has its own challenges. How do you sort the good studies from 
the bad? How do you compare studies when the dependent variables differ? Should you put more weight 
on studies with larger samples? A discussion of meta-analysis and its challenges goes beyond the scope of 
this text. The interested reader is referred to Hedges and Olkin (1985), Cooper and Hedges (1994), and, for 
more recent work that interprets p-values from published studies, Simonsohn, Nelson, and Simmons (2014). 
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Key Concept 6.4. For each, we discuss what can be done to reduce this bias. The sec- 
tion concludes with a discussion of circumstances that lead to inconsistent standard 
errors and what can be done about it. 


Omitted Variable Bias 


Recall that omitted variable bias arises when a variable that both determines Y and 
is correlated with one or more of the included regressors is omitted from the regres- 
sion. This bias persists even in large samples, so the OLS estimator is inconsistent. 
How best to minimize omitted variable bias depends on whether or not variables that 
adequately control for the potential omitted variable are available. 


Solutions to omitted variable bias when the variable is observed or there are 
adequate control variables. If you have data on the omitted variable, then you can 
include that variable in a multiple regression, thereby addressing the problem. 
Alternatively, if you have data on one or more control variables and if these control 
variables are adequate in the sense that they lead to conditional mean independence 
[Equation (6.18)], then including those control variables eliminates the potential bias 
in the coefficient on the variable of interest. 

Adding a variable to a regression has both costs and benefits. On the one hand, 
omitting the variable could result in omitted variable bias. On the other hand, includ- 
ing the variable when it does not belong (that is, when its population regression coef- 
ficient is 0) reduces the precision of the estimators of the other regression coefficients. 
In other words, the decision whether to include a variable involves a trade-off 
between bias and variance of the coefficient of interest. In practice, there are four 
steps that can help you decide whether to include a variable or set of variables in a 
regression. 

The first step is to identify the key coefficient or coefficients of interest in your 
regression. In the test score regressions, this is the coefficient on the student-teacher 
ratio because the question originally posed concerns the effect on test scores of 
reducing the student-teacher ratio. 

The second step is to ask yourself: What are the most likely sources of important 
omitted variable bias in this regression? Answering this question requires applying 
economic theory and expert knowledge, and should occur before you actually run 
any regressions; because this step is done before analyzing the data, it is referred to 
as a priori (“before the fact”) reasoning. In the test score example, this step entails 
identifying those determinants of test scores that, if ignored, could bias our estima- 
tor of the class size effect. The results of this step are a base regression specification, 
the starting point for your empirical regression analysis, and a list of additional, 
“questionable” control variables that might help to mitigate possible omitted vari- 
able bias. 

The third step is to augment your base specification with the additional, ques- 
tionable control variables identified in the second step. If the coefficients on the 
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Omitted Variable Bias: Should | Include More Variables 
in My Regression? 92 


If you include another variable in your multiple regression, you will eliminate the 


possibility of omitted variable bias from excluding that variable, but the variance 


of the estimator of the coefficients of interest can increase. Here are some guide- 


lines to help you decide whether to include an additional variable: 


ik 
2 


Be specific about the coefficient or coefficients of interest. 


Use a-priori reasoning to identify the most important potential sources of omitted 
variable bias, leading to a base specification and some “questionable” variables. 


Test whether additional, “questionable” control variables have nonzero coef- 
ficients, and assess whether including a questionable control variable makes 
a meaningful change in the coefficient of interest. 


Provide “full disclosure” representative tabulations of your results so that 
others can see the effect of including the questionable variables on the 
coefficient(s) of interest. 


additional control variables are statistically significant and/or if the estimated coef- 
ficients of interest change appreciably when the additional variables are included, 
then they should remain in the specification and you should modify your base speci- 
fication. If not, then these variables can be excluded from the regression. 

The fourth step is to present an accurate summary of your results in tabular form. 
This provides “full disclosure” to a potential skeptic, who can then draw his or her 
own conclusions. Tables 7.1 and 8.3 are examples of this strategy. For example, in 
Table 8.3, we could have presented only the regression in column (7) because that 
regression summarizes the relevant effects and nonlinearities in the other regressions 
in that table. Presenting the other regressions, however, permits the skeptical reader 
to draw his or her own conclusions. 

These steps are summarized in Key Concept 9.2. 


Solutions to omitted variable bias when adequate control variables are not 
available. Adding an omitted variable to a regression is not an option if you do not 
have data on that variable and if there are no adequate control variables. Still, there 
are three other ways to solve omitted variable bias. Each of these three solutions 
circumvents omitted variable bias through the use of different types of data. 

The first solution is to use data in which the same observational unit is observed 
at different points in time. For example, test score and related data might be collected 
for the same districts in 1995 and again in 2000. Data in this form are called panel data. 
As explained in Chapter 10, panel data make it possible to control for unobserved 
omitted variables as long as those omitted variables do not change over time. 
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93 


Functional Form Misspecification 


Functional form misspecification arises when the functional form of the esti- 
mated regression function differs from the functional form of the population 
regression function. If the functional form is misspecified, then the estimator of 
the partial effect of a change in one of the variables will, in general, be biased. 
Functional form misspecification often can be detected by plotting the data and 
the estimated regression function, and it can be corrected by using a different 
functional form. 


The second solution is to use instrumental variables regression. This method 
relies on a new variable, called an instrumental variable. Instrumental variables 
regression is discussed in Chapter 12. 

The third solution is to use a study design in which the effect of interest (for 
example, the effect of reducing class size on student achievement) is studied using a 
randomized controlled experiment. Randomized controlled experiments are 
discussed in Chapter 13. 


Misspecification of the Functional Form 
of the Regression Function 


If the true population regression function is nonlinear but the estimated regression is 
linear, then this functional form misspecification makes the OLS estimator biased. This 
bias is a type of omitted variable bias, in which the omitted variables are the terms that 
reflect the missing nonlinear aspects of the regression function. For example, if the 
population regression function is a quadratic polynomial, then a regression that omits 
the square of the independent variable would suffer from omitted variable bias. Bias 
arising from functional form misspecification is summarized in Key Concept 9.3. 


Solutions to functional form misspecification. When the dependent variable is continu- 
ous (like test scores), this problem of potential nonlinearity can be solved using the meth- 
ods of Chapter 8. If, however, the dependent variable is discrete or binary (for example, 
if Y, equals 1 if the i" person attended college and equals 0 otherwise), things are more 
complicated. Regression with a discrete dependent variable is discussed in Chapter 11. 


Measurement Error and Errors-in-Variables Bias 


Suppose that in our regression of test scores against the student-teacher ratio we had 
inadvertently mixed up our data, so that we ended up regressing test scores for fifth 
graders on the student-teacher ratio for tenth graders in that district. Although the 
student-teacher ratio for elementary school students and tenth graders might be 
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correlated, they are not the same, so this mix-up would lead to bias in the estimated 
coefficient. This is an example of errors-in-variables bias because its source is an 
error in the measurement of the independent variable. This bias persists even in very 
large samples, so the OLS estimator is inconsistent if there is measurement error. 

There are many possible sources of measurement error. If the data are collected 
through a survey, a respondent might give the wrong answer. For example, one ques- 
tion in the Current Population Survey involves last year’s earnings. A respondent 
might not know his or her exact earnings or might misstate the amount for some 
other reason. If instead the data are obtained from computerized administrative 
records, there might have been errors when the data were first entered. 

To see that errors in variables can result in correlation between the regressor and 
the error term, suppose there is a single regressor X; (say, actual earnings) which is 
measured imprecisely by X; (the respondent’s stated earnings). Because X,, not X, is 
observed, the regression equation actually estimated is the one based on_X;. Written 
in terms of the imprecisely measured variable X,, the population regression equation 
Y; = Bo + BX; + ujis 


Y, = Bo + BX; + [B(X — X) + ui] 
= Bo + BX; + Vi, (9.1) 


where v; = B(X; — X;) + u;. Thus the population regression equation written in 
terms of x has an error term that contains the measurement error, the difference 
between x and _ X;. If this difference is correlated with the measured value x then 
the regressor x will be correlated with the error term, and Bi will be biased and 
inconsistent. 

The precise size and direction of the bias in Bi depend on the correlation between 
x and the measurement error, x — X;. This correlation depends in turn on the 
specific nature of the measurement error. 

For example, suppose the measured value, Ñ, equals the actual, unmeasured 
value, X;, plus a purely random component, w;, which has mean 0 and variance o%.. 
Because the error is purely random, we might suppose that w; is uncorrelated with X; 
and with the regression error u;. This assumption constitutes the classical measurement 
error model, in which X; = X; + w;, where corr(w;, X;) = 0 and corr(w,, u;) = 0. 
Under the classical measurement error model, a bit of algebra” shows that B, has the 


probability limit 
2 
i T 
p X 
By 2 2 Pi. (9.2) 
Ox + Oy 
?Under this measurement error assumption, v; = B,(X; — X;) + u; Biw; + u; cov(X;,u;) = 0, and 


cov(X;,w;) = cov(X; + wp wi) = 02, so cov( X, v;) = —B,cov(X;, w;) + cov( X, u;) = —B,o2,. Thus, 
from Equation (6.1), ĝi > Bı — Bie? /o%. Now o$ = o% + 07,80 B > B- Bio, / (0% + 02) = 
[ox / (ox + oF) 1B1- 
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Errors-in-Variables Bias 


9.4 


Errors-in-variables bias in the OLS estimator arises when an independent vari- 
able is measured imprecisely. This bias depends on the nature of the measurement 
error and persists even if the sample size is large. If the measured variable equals 
the actual value plus a mean 0, independently distributed measurement error, 
then the OLS estimator in a regression with a single right-hand variable is biased 
toward 0, and its probability limit is given in Equation (9.2). 


That is, if the measurement error has the effect of simply adding a random element 
to the actual value of the independent variable, then £; is inconsistent. Because the 


2 
ox 


ratio is less than 1, By will be biased toward 0, even in large samples. In the 


ox + o% 
aniem case that the measurement error is so large that essentially no information 
about X; remains, the ratio of the variances in the final expression in Equation (9.2) 
is 0, and By converges in probability to 0. In the other extreme, when there is no 
measurement error, o2, = 0,80 Êi > £. 

A different model of measurement error supposes that the respondent makes his 
or her best estimate of the true value. In this “best guess” model, the response &; is 
modeled as the conditional mean of X; given the information available to the respon- 
dent. Because x is the best guess, the measurement error x — X; is uncorrelated 
with the response X (if the measurement error were correlated with x, then that 
would be useful information for predicting X, in which case X; would not have been 
the best guess of X;). That is, E[ (X; — X;)X;] = 0, and if the respondent’s informa- 
tion is uncorrelated with u; then X is uncorrelated with the error term v;. Thus, in this 
“best guess” measurement error model, ĝi is consistent, but because var(v;) > var(u;), 
the variance of Bi is larger than it would be absent measurement error. The “best 
guess” measurement error model is examined further in Exercise 9.12. 

Problems created by measurement error can be even more complicated if there 
is intentional misreporting. For example, suppose that survey respondents provide 
the income reported on their income taxes but intentionally underreport their true 
taxable income by not including cash payments. If, for example, all respondents 
report only 90% of income, then x = 0.90X;, and Bi will be biased up by 10%. 

Although the result in Equation (9.2) is specific to classical measurement error, 
it illustrates the more general proposition that if the independent variable is mea- 
sured imprecisely, then the OLS estimator may be biased, even in large samples. 
Errors-in-variables bias is summarized in Key Concept 9.4. 


Measurement error in Y. The effect of measurement error in Y is different from that 
of measurement error in X. If Y has classical measurement error, then this measure- 
ment error increases the variance of the regression and of 8, but does not induce bias 
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in Bi. To see this, suppose that measured Y; is A which equals true Y; plus random 
measurement error w;. Then the regression model estimated is Ý, = Bo + BX; + v; 
where v; = w; + u;. If w;is truly random, then w; and X; are independently distrib- 
uted, so that E(w;|X,) = 0,in which case E(v;|X;) = 0,so Ĝĝ is unbiased. However, 
because var(v;) > var(u;), the variance of By is larger than it would be without 
measurement error. In the test score/class size example, suppose test scores have 
purely random grading errors that are independent of the regressors; then the classi- 
cal measurement error model of this paragraph applies to Ý, and By is unbiased. 
More generally, measurement error in Y that has conditional mean 0 given the 
regressors will not induce bias in the OLS coefficients. 


Solutions to errors-in-variables bias. The best way to solve the errors-in-variables 
problem is to get an accurate measure of X. If this is impossible, however, economet- 
ric methods can be used to mitigate errors-in-variables bias. 

One such method is instrumental variables regression. It relies on having another 
variable (the instrumental variable) that is correlated with the actual value X; but is 
uncorrelated with the measurement error. This method is studied in Chapter 12. 

A second method is to develop a mathematical model of the measurement error 
and, if possible, to use the resulting formulas to adjust the estimates. For example, if 
a researcher believes that the classical measurement error model applies and if she 
knows or can estimate the ratio 0, / ox, then she can use Equation (9.2) to compute 
an estimator of p that corrects for the downward bias. Because this approach 
requires specialized knowledge about the nature of the measurement error, the 
details typically are specific to a given data set and its measurement problems, and 
we shall not pursue this approach further in this text. 


Missing Data and Sample Selection 


Missing data are a common feature of economic data sets. Whether missing data pose 
a threat to internal validity depends on why the data are missing. We consider three 
cases: when the data are missing completely at random, when the data are missing 
based on X, and when the data are missing because of a selection process that is 
related to Y beyond depending on X. 

When the data are missing completely at random — that is, for random reasons 
unrelated to the values of X or Y—the effect is to reduce the sample size but not 
introduce bias. For example, suppose you conduct a simple random sample of 100 
classmates, then randomly lose half the records. It would be as if you had never sur- 
veyed those individuals. You would be left with a simple random sample of 50 class- 
mates, so randomly losing the records does not introduce bias. 

When the data are missing based on the value of a regressor, the effect also is to 
reduce the sample size but not to introduce bias. For example, in the class size/ 
student-teacher ratio example, suppose we used only the districts in which the stu- 
dent-—teacher ratio exceeds 20. Although we would not be able to draw conclusions 


340 CHAPTER9 Assessing Studies Based on Multiple Regression 


Sample Selection Bias 


22 


Sample selection bias arises when a selection process influences the availability 
of data and that process is related to the dependent variable beyond depend- 
ing on the regressors. Such sample selection induces correlation between one or 
more regressors and the error term, leading to bias and inconsistency of the OLS 
estimator. 


about what happens when STR = 20, this would not introduce bias into our analysis 
of the class size effect for districts with STR > 20. 

In contrast to the first two cases, if the data are missing because of a selection 
process that is related to the value of the dependent variable (Y) beyond depending 
on the regressors (X), then this selection process can introduce correlation between 
the error term and the regressors. The resulting bias in the OLS estimator is called 
sample selection bias. An example of sample selection bias in polling was given in 
the box “Landon Wins!” in Section 3.1. In that example, the sample selection method 
(randomly selecting phone numbers of automobile owners) was related to the depen- 
dent variable (who the individual supported for president in 1936) because in 1936 
car owners with phones were more likely to be Republicans. The sample selection 
problem can be cast either as a consequence of nonrandom sampling or as a missing 
data problem. In the 1936 polling example, the sample was a random sample of car 
owners with phones, not a random sample of voters. Alternatively, this example can 
be cast as a missing data problem by imagining a random sample of voters but with 
missing data for those without cars and phones. The mechanism by which the data 
are missing is related to the dependent variable, leading to sample selection bias. 

Sample selection bias is summarized in Key Concept 9.5.3 


Solutions to selection bias. The best solution to sample selection bias is to avoid it 
by the design of your study. If you want to estimate the mean height of undergradu- 
ates for your statistics course, do so by using a random sample of all undergradu- 
ates—not by sampling students as they enter a basketball court. The box “Do Stock 
Mutual Funds Outperform the Market?” describes a way to select a sample of funds 
to avoid a more subtle form of sample selection bias. If your data do have sample 
selection bias, it cannot be eliminated using the methods we have discussed so far. 
Methods for estimating models with sample selection are beyond the scope of this 
text. Some of those methods build on the techniques introduced in Chapter 11, where 
further references are provided. 


Exercise 19.16 provides a mathematical treatment of the three missing data cases discussed here. 
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Do Stock Mutual Funds Outperform the Market? 


S tock mutual funds are investment vehicles 
that hold a portfolio of stocks. By purchasing 
shares in a mutual fund, a small investor can hold 
a broadly diversified portfolio without the hassle 
and expense (transaction cost) of buying and selling 
shares in individual companies. Some mutual funds 
simply track the market (for example, by holding the 
stocks in the S&P 500), whereas others are actively 
managed by full-time professionals whose job is to 
make the fund earn a better return than the over- 
all market—and competitors’ funds. But do these 
actively managed funds achieve this goal? Do some 
mutual funds consistently beat other funds and the 
market? 

One way to answer these questions is to com- 
pare future returns on mutual funds that had high 
returns over the past year to future returns on other 
funds and on the market as a whole. In making such 
comparisons, financial economists know that it is 
important to select the sample of mutual funds care- 
fully. This task is not as straightforward as it seems, 
however. Some databases include historical data 
on funds currently available for purchase, but this 
approach means that the dogs—the most poorly 
performing funds—are omitted from the data set 


because they went out of business or were merged 


Simultaneous Causality 
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into other funds. For this reason, a study using data 
on historical performance of currently available 
funds is subject to sample selection bias: The sample 
is selected based on the value of the dependent vari- 
able, returns, because funds with the lowest returns 
are eliminated. The mean return of all funds (includ- 
ing the defunct) over a ten-year period will be less 
than the mean return of those funds still in existence 
at the end of those ten years, so a study of only the 
latter funds will overstate performance. Financial 
economists refer to this selection bias as survivor- 
ship bias because only the better funds survive to be 
in the data set. 

When financial econometricians correct for 
survivorship bias by incorporating data on defunct 
funds, the results do not paint a flattering portrait 
of mutual fund managers. Corrected for survivor- 
ship bias, the econometric evidence indicates that 
actively managed stock mutual funds do not outper- 
form the market, on average, and that past good per- 
formance does not predict future good performance. 
For further reading on mutual funds and survivor- 
ship bias, see Malkiel (2016), Chapter 7, and Carhart 
(1997). The problem of survivorship bias also arises 
in evaluating hedge fund performance; for further 


reading, see Aggarwal and Jorion (2010). 


So far, we have assumed that causality runs from the regressors to the dependent vari- 
able (X causes Y). But what if causality also runs from the dependent variable to one 
or more regressors (Y causes X)? If so, causality runs “backward” as well as forward; 
that is, there is simultaneous causality. If there is simultaneous causality, an OLS 
regression picks up both effects, so the OLS estimator is biased and inconsistent. 
For example, our study of test scores focused on the effect on test scores of 
reducing the student-teacher ratio, so causality is presumed to run from the student- 
teacher ratio to test scores. Suppose, however, a government initiative subsidized 
hiring teachers in school districts with poor test scores. If so, causality would run in 
both directions: For the usual educational reasons, low student-teacher ratios would 
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arguably lead to high test scores, but because of the government program, low test 
scores would lead to low student-teacher ratios. 

Simultaneous causality leads to correlation between the regressor and the error 
term. In the test score example, suppose there is an omitted factor that leads to poor 
test scores; because of the government program, this factor that produces low scores 
in turn results in a low student-teacher ratio. Thus a negative error term in the popu- 
lation regression of test scores on the student-teacher ratio reduces test scores, but 
because of the government program, it also leads to a decrease in the student-teacher 
ratio. In other words, the student-teacher ratio is positively correlated with the error 
term in the population regression. This in turn leads to simultaneous causality bias 
and inconsistency of the OLS estimator. 

This correlation between the error term and the regressor can be made mathe- 
matically precise by introducing an additional equation that describes the reverse 
causal link. For convenience, consider just the two variables X and Y, and ignore 
other possible regressors. Accordingly, there are two equations, one in which X causes 
Y and one in which Y causes X: 


Y; = Bo + BX; + Ui and (9.3) 


Xi = Yo + YY; + v (9.4) 


Equation (9.3) is the familiar one in which £; is the effect on Y of a change in X, 
where u represents other factors. Equation (9.4) represents the reverse causal effect 
of Y on X. In the test score problem, Equation (9.3) represents the educational 
effect of class size on test scores, while Equation (9.4) represents the reverse causal 
effect of test scores on class size induced by the government program. 

Simultaneous causality leads to correlation between X; and the error term u; in 
Equation (9.3). To see this, imagine that u; is positive, which increases Y;. However, 
this higher value of Y; affects the value of X; through the second of these equations, 
and if y4 is positive, a high value of Y; will lead to a high value of X;. In general, if yı 
is nonzero, X; and u; will be correlated.* 

Because it can be expressed mathematically using two simultaneous equations, 
simultaneous causality bias is sometimes called simultaneous equations bias. Simul- 
taneous causality bias is summarized in Key Concept 9.6. 


Solutions to simultaneous causality bias. There are two ways to mitigate simultaneous 
causality bias. One is to use instrumental variables regression, the topic of Chapter 12. 
The second is to design and implement a randomized controlled experiment in which the 
reverse causality channel is nullified, and such experiments are discussed in Chapter 13. 


“To show this mathematically, note that Equation (9.4) implies that cov(X;,u;) = cov(y) + 
yY; + vi, ui) = yıcov( Y, u;) + cov(v; u;).Assuming that cov(v;,u;) = 0, by Equation (9.3) this in turn 
implies that cov( X;,u;) = y,;cov(By + BX; + uj, uj) = y1B;cov(X;, u;) + y1:0%. Solving for cov( X; u;) 
then yields the result cov(X;, u;) = y;o2/ (1 — y1B;)- 
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Simultaneous Causality Bias 


96 


Simultaneous causality bias, also called simultaneous equations bias, arises in a 
regression of Y on X when, in addition to the causal link of interest from X to Y, 
there is a causal link from Y to X. This reverse causality makes X correlated with 
the error term in the population regression of interest. 


Sources of Inconsistency of OLS Standard Errors 


Inconsistent standard errors pose a different threat to internal validity. Even if the OLS 
estimator is consistent and the sample is large, inconsistent standard errors will produce 
hypothesis tests with size that differs from the desired significance level and “95%” 
confidence intervals that fail to include the true value in 95% of repeated samples. 
There are two main reasons for inconsistent standard errors: improperly handled 
heteroskedasticity and correlation of the error term across observations. 


Heteroskedasticity. As discussed in Section 5.4, for historical reasons, some regres- 
sion software reports homoskedasticity-only standard errors. If, however, the regres- 
sion error is heteroskedastic, those standard errors are not a reliable basis for 
hypothesis tests and confidence intervals. The solution to this problem is to use 
heteroskedasticity-robust standard errors and to construct F-statistics using a 
heteroskedasticity-robust variance estimator. Heteroskedasticity-robust standard 
errors are provided as an option in modern software packages. 


Correlation of the error term across observations. In some settings, the population 
regression error can be correlated across observations. This will not happen if the data 
are obtained by sampling at random from the population because the randomness of 
the sampling process ensures that the errors are independently distributed from one 
observation to the next. Sometimes, however, sampling is only partially random. The 
most common circumstance is when the data are repeated observations on the same 
entity over time, such as the same school district for different years. If the omitted 
variables that constitute the regression error are persistent (like district demograph- 
ics), “serial” correlation is induced in the regression error over time. Serial correlation 
in the error term can arise in panel data (e.g., data on multiple districts for multiple 
years) and in time series data (e.g., data on a single district for multiple years). 

Another situation in which the error term can be correlated across observations 
is when sampling is based on a geographical unit. If there are omitted variables that 
reflect geographic influences, these omitted variables could result in correlation of 
the regression errors for adjacent observations. 

Correlation of the regression error across observations does not make the OLS 
estimator biased or inconsistent, but it does violate the second least squares 
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Threats to the Internal Validity of a Multiple Regression Study 


There are five primary threats to the internal validity of a multiple regression 
study: 

1. Omitted variables 

2. Functional form misspecification 

3. Errors in variables (measurement error in the regressors) 
4. Sample selection 
5 


. Simultaneous causality. 


Each of these, if present, results in failure of the first least squares assumption in 
Key Concept 6.4 (or, if there are control variables, in Key Concept 6.6), which in 
turn means that the OLS estimator is biased and inconsistent. 

Incorrect calculation of the standard errors also poses a threat to internal 
validity. Homoskedasticity-only standard errors are invalid if heteroskedasticity 
is present. If the variables are not independent across observations, as can arise 
in panel and time series data, then a further adjustment to the standard error for- 
mula is needed to obtain valid standard errors. 

Applying this list of threats to a multiple regression study provides a system- 
atic way to assess the internal validity of that study. 


assumption in Key Concept 6.4. The consequence is that the OLS standard errors — 
both homoskedasticity-only and heteroskedasticity-robust—are incorrect in the 
sense that they do not produce confidence intervals with the desired confidence level. 

In many cases, this problem can be fixed by using an alternative formula for 
standard errors. We provide formulas for computing standard errors that are robust 
to both heteroskedasticity and serial correlation in Chapter 10 (regression with panel 
data) and in Chapter 16 (regression with time series data). 

Key Concept 9.7 summarizes the threats to internal validity of a multiple regres- 
sion study. 


Internal and External Validity When 
the Regression Is Used for Prediction 


When regression models are used for prediction, concerns about external validity are 
very important, but concerns about unbiased estimation of causal effects are not. 
Chapter 4 began by considering two problems. A school superintendent wants 
to know how much test scores will increase if she reduces class sizes in her school 
district; that is, the superintendent wants to know the causal effect on test scores of 
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a change in class size. A father, considering moving to a school district for which test 
scores are not publicly available, wants a reliable prediction about test scores in that 
district, based on data to which he has access. The father does not need to know the 
causal effect on test scores of class size—or, for that matter, of any variable. What 
matters to him is that the prediction equation estimated using the California district- 
level data provides an accurate and reliable prediction of test scores for the district 
to which the father is considering moving. 

Reliable prediction using multiple regression has three requirements. The first 
requirement is that the data used to estimate the prediction model and the obser- 
vation for which the prediction is to be made are drawn from the same distribu- 
tion. This requirement is formalized as the first least squares assumption for 
prediction, given in Appendix 6.4 for the case of multiple predictors. If the estima- 
tion and prediction observations are drawn from the same population, then the 
estimated conditional expectation of Y given X generalizes to the out-of-sample 
observation to be predicted. This requirement is a mathematical statement of 
external validity in the prediction context. In the test score example, if the esti- 
mated regression line is useful for other districts in California, it could well be 
useful for elementary school districts in other states, but it is unlikely to be useful 
for colleges. 

The second requirement involves the list of predictors. When the aim is to esti- 
mate a causal effect, it is important to choose control variables to reduce the threat 
of omitted variable bias. In contrast, for prediction the aim is to have an accurate 
out-of-sample forecast. For this purpose, the predictors should be ones that substan- 
tially contribute to explaining the variation in Y, whether or not they have any causal 
interpretation. The question of choice of predictor is further complicated when there 
are time series data, for then there is the opportunity to exploit correlation over time 
(serial correlation) to make forecasts—that is, predictions of future values of 
variables. The use of multiple regression for time series forecasting is taken up in 
Chapters 15 and 17. 

The third requirement concerns the estimator itself. So far, we have focused on 
OLS for estimating multiple regression. In some prediction applications, however, 
there are very many predictors; indeed, in some applications the number of predic- 
tors can exceed the sample size. If there are very many predictors, then there are— 
surprisingly—some estimators that can provide more accurate out-of-sample 
predictions than OLS. Chapter 14 takes up prediction with many predictors and 
explains these specialized estimators. 


Example: Test Scores and Class Size 


The framework of internal and external validity helps us to take a critical look at 
what we have learned—and what we have not—from our analysis of the California 
test score data. 
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Summary Statistics for California and Massachusetts Test Score Data Sets 
California Massachusetts 

Average Standard Deviation Average Standard Deviation 
Test scores 654.1 19.1 709.8 15.1 
Student-teacher ratio 19.6 19 173 2.3 
% English learners 15.8% 18.3% 11% 2.9% 
% receiving subsidized lunch 44.7% 271% 15.3% 15.1% 
Average district income ($) $15,317 $7226 $18,747 $5808 
Number of observations 420 220 
Year 1999 1998 
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External Validity 


Whether the California analysis can be generalized —that is, whether it is externally 
valid—depends on the population and setting to which the generalization is made. 
Here, we consider whether the results can be generalized to performance on other 
standardized tests in other elementary public school districts in the United States. 

Section 9.1 noted that having more than one study on the same topic provides 
an opportunity to assess the external validity of both studies by comparing their 
results. In the case of test scores and class size, other comparable data sets are, in fact, 
available. In this section, we examine a different data set, based on standardized test 
results for fourth graders in 220 public school districts in Massachusetts in 1998. Both 
the Massachusetts and California tests are broad measures of student knowledge and 
academic skills, although the details differ. Similarly, the organization of classroom 
instruction is broadly similar at the elementary school level in the two states (as it is 
in most U.S. elementary school districts), although aspects of elementary school 
funding and curriculum differ. Thus finding similar results about the effect of the 
student-teacher ratio on test performance in the California and Massachusetts data 
would be evidence of external validity of the findings in California. Conversely, find- 
ing different results in the two states would raise questions about the internal or 
external validity of at least one of the studies. 


Comparison of the California and Massachusetts data. Like the California data, 
the Massachusetts data are at the school district level. The definitions of the variables 
in the Massachusetts data set are the same as those in the California data set, or 
nearly so. More information on the Massachusetts data set, including definitions of 
the variables, is given in Appendix 9.1. 

Table 9.1 presents summary statistics for the California and Massachusetts sam- 
ples. The average test score is higher in Massachusetts, but the test is different, so a 
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direct comparison of scores is not appropriate. The average student-teacher ratio is 
higher in California than in Massachusetts (19.6 versus 173). Average district income 
is 20% higher in Massachusetts, but the standard deviation of district income is 
greater in California; that is, there is a greater spread in average district income in 
California than in Massachusetts. The average percentage of students still learning 
English and the average percentage of students receiving subsidized lunches are both 
much higher in the California districts than in the Massachusetts districts. 


Test scores and average district income. To save space, we do not present scat- 
terplots of all the Massachusetts data. Because it was a focus in Chapter 8, how- 
ever, it is interesting to examine the relationship between test scores and average 
district income in Massachusetts. This scatterplot is presented in Figure 9.1. The 
general pattern of this scatterplot is similar to that in Figure 8.2 for the California 
data: The relationship between district income and test scores appears to be steep 
for low values of income and flatter for high values. Evidently, the linear regres- 
sion plotted in the figure misses this apparent nonlinearity. Cubic and logarithmic 
regression functions are also plotted in Figure 9.1. The cubic regression function 
has a slightly higher R? than the logarithmic specification (0.486 versus 0.455). 
Comparing Figures 8.7 and 9.1 shows that the general pattern of nonlinearity 
found in the California district income and test score data is also present in the 
Massachusetts data. The precise functional forms that best describe this 


a 
| FIGURE9.1 | Test Scores vs. District Income for Massachusetts Data 
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nonlinearity differ, however, with the cubic specification fitting best in Massachu- 
setts but the linear-log specification fitting best in California. 


Multiple regression results. Regression results for the Massachusetts data are pre- 
sented in Table 9.2. The first regression, reported in column (1) in the table, has only 
the student-teacher ratio as a regressor. The slope is negative (—1.72), and the 
hypothesis that the coefficient is 0 can be rejected at the 1% significance level 
(t = —1.72/0.50 = —3.44). 

The remaining columns report the results of including additional variables 
that control for student characteristics and of introducing nonlinearities into the 
estimated regression function. Controlling for the percentage of English learners, 
the percentage of students eligible for a subsidized lunch, and the average district 
income reduces the estimated coefficient on the student-teacher ratio by 60%, 
from —1.72 in regression (1) to —0.69 in regression (2) and —0.64 in regression (3). 

Comparing the R”’s of regressions (2) and (3) indicates that the cubic specifica- 
tion (3) provides a better model of the relationship between test scores and district 
income than does the logarithmic specification (2), even holding constant the 
student-teacher ratio. There is no statistically significant evidence of a nonlinear 
relationship between test scores and the student-teacher ratio: The F-statistic in 
regression (4) testing whether the population coefficients on STR? and STR? are 0 
has a p-value of 0.641. The estimates in regression (5) suggest that a class size reduc- 
tion is less effective when there are many English learners, the opposite finding from 
the California data; however, as in the California data, this interaction effect is 
imprecisely estimated and is not statistically significant at the 10% level [the 
t-statistic on HiEL X STR in regression (5) is 0.80/0.56 = 1.43]. Finally, regression 
(6) shows that the estimated coefficient on the student-teacher ratio does not 
change substantially when the percentage of English learners [which is insignificant 
in regression (3)] is excluded. In short, the results in regression (3) are not sensitive 
to the changes in functional form and specification considered in regressions (4) 
through (6) in Table 9.2. Therefore, we adopt regression (3) as our base estimate of 
the effect on test scores of a change in the student-teacher ratio based on the Mas- 
sachusetts data. 


Comparison of Massachusetts and California results. For the California data, we 
found the following: 


1. Adding variables that control for student background characteristics reduced 
the coefficient on the student-teacher ratio from —2.28 [Table 7.1, regression 
(1)] to —0.73 [Table 8.3, regression (2)], a reduction of 68%. 

2. The hypothesis that the true coefficient on the student-teacher ratio is 0 was 
rejected at the 1% significance level, even after adding variables that control 
for student background and district economic characteristics. 
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LGE Multiple Regression Estimates of the Student-Teacher Ratio and Test Scores: 
Data from Massachusetts 


Dependent variable: average combined English, math, and science test score in the school district, fourth grade; 
220 observations. 


Regressor (1) (2) (3) (4) (5) (6) 
Student-teacher ratio =172 —0.69 —0.64 12.4 —1.02 —0.67 
(STR) (0.50) (0.27) (0.27) (14.0) (0.37) (0.27) 
[—2.70, —0.73] [—1.22, —0.16] [—1.17, —0.11] [-1.21, —0.14] 
STR? —0.680 
(0.737) 
STR? 0.011 
(0.013) 
% English learners —0.411 —0.437 —0.434 
(0.306) (0.303) (0.300) 
% English learners > median? =12.6 
(Binary, HiEL) (9.8) 
HiEL X STR 0.80 
(0.56) 
% eligible for free lunch —0.521 —0.582 —0.587 —0.709 —0.653 
(0.077) (0.097) (0.104) (0.091) (0.72) 
District income (logarithm) 16.53 
(3.15) 
District income —3.07 —3.38 —3.87 —3.22 
(2.35) (2.49) (2.49) (2.31) 
District income?” 0.164 0.174 0.184 0.165 
(0.085) (0.089) (0.090) (0.085) 
District income? —0.0022 —0.0023 —0.0023 —0.0022 


(0.0010) (0.0010) (0.0010) (0.0010) 


F-Statistics and p-Values Testing Exclusion of Groups of Variables 


All STR variables and 2.86 4.01 
interactions = 0 (0.038) (0.020) 
STR?, STR? = 0 0.45 
(0.641) 

Income’, Income? 7.14 TIS 5.85 6.55 

(< 0.001) (< 0.001) (0.003) (0.002) 
HiEL, HiEL X STR 1.58 

(0.208) 

SER 14.64 8.69 8.61 8.63 8.62 8.64 
R 0.063 0.670 0.676 0.675 0.675 0.674 


These regressions were estimated using the data on Massachusetts elementary school districts described in Appendix 9.1. 
All regressions include an intercept (not reported). Standard errors are given in parentheses under the coefficients, and 
p-values are given in parentheses under the F-statistics. 95% confidence intervals for the coefficient on the student-teacher 
ratio are presented in brackets for regressions (1), (2), (3), and (6), but not for the regressions with nonlinear terms in STR. 
\ i 
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3. The effect of cutting the student-teacher ratio did not depend in a statistically 
significant way on the percentage of English learners in the district. 

4. There is some evidence that the relationship between test scores and the 
student-teacher ratio is nonlinear. 


Do we find the same things in Massachusetts? For findings (1), (2), and (3), the 
answer is yes. Including the additional control variables reduces the coefficient on 
the student-teacher ratio from —1.72 [Table 9.2, regression (1)] to —0.69 [Table 9.2, 
regression (2)], a reduction of 60%. The coefficients on the student-teacher ratio 
remain significant after adding the control variables. Those coefficients are signifi- 
cant only at the 5% level in the Massachusetts data, whereas they are significant at 
the 1% level in the California data. However, there are nearly twice as many obser- 
vations in the California data, so it is not surprising that the California estimates 
are more precise. As in the California data, there is no statistically significant evi- 
dence in the Massachusetts data of an interaction between the student-teacher 
ratio and the binary variable indicating a large percentage of English learners in 
the district. 

Finding (4), however, does not hold up in the Massachusetts data: The hypothesis 
that the relationship between the student-teacher ratio and test scores is linear can- 
not be rejected at the 5% significance level when tested against a cubic 
specification. 

Because the two standardized tests are different, the coefficients themselves can- 
not be compared directly: One point on the Massachusetts test is not the same as one 
point on the California test. If, however, the test scores are put into the same units, 
then the estimated class size effects can be compared. One way to do this is to trans- 
form the test scores by standardizing them: Subtract the sample average and divide 
by the standard deviation so that they have a mean of 0 and a variance of 1. The slope 
coefficients in the regression with the standardized test score equal the slope coef- 
ficients in the original regression divided by the standard deviation of the test. Thus 
the coefficient on the student-teacher ratio divided by the standard deviation of test 
scores can be compared across the two data sets. 

This comparison is undertaken in Table 9.3. The first column reports the OLS 
estimates of the coefficient on the student-teacher ratio in a regression with the 
percentage of English learners, the percentage of students eligible for a subsidized 
lunch, and the average district income included as control variables. The second 
column reports the standard deviation of the test scores across districts. The final 
two columns report the estimated effect on test scores of reducing the student-teacher 
ratio by two students per teacher (our superintendent’s proposal), first in the 
units of the test and second in standard deviation units. For the linear specifica- 
tion, the OLS coefficient estimate using California data is —0.73, so cutting the 
student-teacher ratio by two is estimated to increase district test scores by 
—0.73 X (—2) = 1.46 points. Because the standard deviation of test scores is 19.1 
points, this corresponds to 1.46/19.1 = 0.076 standard deviation units of the 
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NIEA Student-Teacher Ratios and Test Scores: Comparing the Estimates from 
California and Massachusetts 
Estimated Effect of Two Fewer 
Students per Teacher, in Units of: 
Standard Deviation 
OLS Estimate of Test Scores Standard 
Bstr Across Districts Points on the Test Deviations 
California 
Linear: Table 8.3(2) —0.73 19.1 1.46 0.076 
(0.26) (0.52) (0.027) 
[0.46, 2.48] [0.024, 0.130] 
Cubic: Table 8.3(7) = 19.1 2.93 0.153 
Reduce STR from 20 to 18 (0.70) (0.037) 
[1.56, 4.30] [0.081, 0.226] 
Cubic: Table 8.3(7) E 19.1 1.90 0.099 
Reduce STR from 22 to 20 (0.69) (0.036) 
[0.54, 3.26] [0.028, 0.171] 
Massachusetts 
Linear: Table 9.2(3) —0.64 15.1 1.28 0.085 
(0.27) (0.54) (0.036) 
[0.22, 2.34] [0.015, 0.154] 
Standard errors are given in parentheses. 95% confidence intervals for the effect of a two-student reduction are given in 
brackets. 
Ne ot 


distribution of test scores across districts. The standard error of this estimate is 
0.26 X 2/19.1 = 0.027. The estimated effects for the nonlinear models and their 
standard errors were computed using the method described in Section 8.1. 

Based on the linear model using California data, a reduction of two students per 
teacher is estimated to increase test scores by 0.076 standard deviation units, with a 
standard error of 0.027 The nonlinear models for California data suggest a somewhat 
larger effect, with the specific effect depending on the initial student-teacher ratio. 
Based on the Massachusetts data, this estimated effect is 0.085 standard deviation 
units, with a standard error of 0.036. 

These estimates are essentially the same. The 95% confidence interval for Mas- 
sachusetts contains the 95% confidence interval for the California linear specifica- 
tion. Cutting the student-teacher ratio is predicted to raise test scores, but the 
predicted improvement is small. In the California data, for example, the difference 
in test scores between the median district and a district at the 75th percentile is 12.2 
test score points (Table 4.1), or 0.64 ( = 12.2/19.1) standard deviation units. The esti- 
mated effect from the linear model is just over one-tenth this size; in other words, 
according to this estimate, cutting the student teacher-ratio by two would move a 
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district only one-tenth of the way from the median to the 75th percentile of the dis- 
tribution of test scores across districts. Reducing the student-teacher ratio by two is 
a large change for a district, but the estimated benefits shown in Table 9.3, while 
nonzero, are small. 

This analysis of Massachusetts data suggests that the California results are exter- 
nally valid, at least when generalized to elementary school districts elsewhere in the 
United States. 


Internal Validity 


The similarity of the results for California and Massachusetts does not ensure their 
internal validity. Section 9.2 listed five possible threats to internal validity that could 
induce bias in the estimated effect on test scores of class size. We consider these 
threats in turn. 


Omitted variables. The multiple regressions reported in this and previous chapters 
control for a student characteristic (the percentage of English learners), a family 
economic characteristic (the percentage of students receiving a subsidized lunch), 
and a broader measure of the affluence of the district (average district income). 

If these control variables are adequate, then for the purpose of regression analy- 
sis it is as if the student-teacher ratio is randomly assigned among districts with the 
same values of these control variables, in which case the conditional mean indepen- 
dence assumption holds. There still could be, however, some omitted factors for 
which these three variables might not be adequate controls. For example, if the 
student-teacher ratio is correlated with teacher quality even among districts with the 
same fraction of immigrants and the same socioeconomic characteristics (perhaps 
because better teachers are attracted to schools with smaller student-teacher ratios) 
and if teacher quality affects test scores, then omission of teacher quality could bias 
the coefficient on the student-teacher ratio. Similarly, among districts with the same 
socioeconomic characteristics, districts with a low student-teacher ratio might have 
families that are more committed to enhancing their children’s learning at home. 
Such omitted factors could lead to omitted variable bias. 

One way to eliminate omitted variable bias, at least in theory, is to conduct an 
experiment. For example, students could be randomly assigned to different size 
classes, and their subsequent performance on standardized tests could be compared. 
Such a study was, in fact, conducted in Tennessee, and we examine it in Chapter 13. 


Functional form. The analysis here and in Chapter 8 explored a variety of functional 
forms. We found that some of the possible nonlinearities investigated were not sta- 
tistically significant, while those that were did not substantially alter the estimated 
effect of reducing the student-teacher ratio. Although further functional form analy- 
sis could be carried out, this suggests that the main findings of these studies are 
unlikely to be sensitive to using different nonlinear regression specifications. 
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Errors in variables. The average student-teacher ratio in the district is a broad and 
potentially inaccurate measure of class size. For example, because students move in 
and out of districts, the student-teacher ratio might not accurately represent the 
actual class sizes experienced by the students taking the test, which in turn could lead 
to the estimated class size effect being biased toward 0. Another variable with poten- 
tial measurement error is average district income. Those data were taken from the 
1990 Census, while the other data pertain to 1998 (Massachusetts) or 1999 (California). 
If the economic composition of the district changed substantially over the 1990s, this 
would be an imprecise measure of the actual average district income. 


Sample selection. The California and the Massachusetts data cover all the public 
elementary school districts in the state that satisfy minimum size restrictions, so there 
is no reason to believe that sample selection is a problem here. 


Simultaneous causality. Simultaneous causality would arise if the performance on stan- 
dardized tests affected the student-teacher ratio. This could happen, for example, if there 
is a bureaucratic or political mechanism for increasing the funding of poorly performing 
schools or districts that in turn resulted in hiring more teachers. In Massachusetts, no 
such mechanism for equalization of school financing was in place during the time of 
these tests. In California, a series of court cases led to some equalization of funding, but 
this redistribution of funds was not based on student achievement. Thus in neither Mas- 
sachusetts nor California does simultaneous causality appear to be a problem. 


Heteroskedasticity and correlation of the error term across observations. All the 
results reported here and in earlier chapters use heteroskedastic-robust standard 
errors, so heteroskedasticity does not threaten internal validity. Correlation of the 
error term across observations, however, could threaten the consistency of the stan- 
dard errors because simple random sampling was not used (the sample consists of all 
elementary school districts in the state). Although there are alternative standard 
error formulas that could be applied to this situation, the details are complicated and 
specialized, and we leave them to more advanced texts. 


Discussion and Implications 


The similarity between the Massachusetts and California results suggests that these stud- 
ies are externally valid in the sense that the main findings can be generalized to perfor- 
mance on standardized tests at other elementary school districts in the United States. 
Some of the most important potential threats to internal validity have been 
addressed by controlling for student background, family economic background, and 
district affluence and by checking for nonlinearities in the regression function. Still, 
some potential threats to internal validity remain. A leading candidate is omitted 
variable bias, perhaps arising because the control variables do not capture other 
characteristics of the school districts or extracurricular learning opportunities. 
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Based on both the California and the Massachusetts data, we are able to answer 
the superintendent’s question from Section 4.1: After controlling for family economic 
background, student characteristics, and district affluence and after modeling nonlin- 
earities in the regression function, cutting the student-teacher ratio by two students per 
teacher is predicted to increase test scores by approximately 0.08 standard deviations 
of the distribution of test scores across districts. This effect is statistically significant, but 
it is quite small. This small estimated effect is in line with the results of the many studies 
that have investigated the effects on test scores of class size reductions.” 

The superintendent can now use this estimate to help her decide whether to 
reduce class sizes. In making this decision, she will need to weigh the costs of the 
proposed reduction against the benefits. The costs include teacher salaries and 
expenses for additional classrooms. The benefits include improved academic perfor- 
mance, which we have measured by performance on standardized tests, but there are 
other potential benefits that we have not studied, including lower dropout rates and 
enhanced future earnings. The estimated effect of the proposal on standardized test 
performance is one important input into her calculation of costs and benefits. 


Conclusion 


The concepts of internal and external validity provide a framework for assessing 
what has been learned from an econometric study of causal effects. 

A study based on multiple regression is internally valid if the estimated coeffi- 
cients are unbiased and consistent and if standard errors are consistent. Threats to 
the internal validity of such a study include omitted variables, misspecification of 
functional form (nonlinearities), imprecise measurement of the independent vari- 
ables (errors in variables), sample selection, and simultaneous causality. Each of 
these introduces correlation between the regressor and the error term, which in turn 
makes OLS estimators biased and inconsistent. If the errors are correlated across 
observations, as they can be with time series data, or if they are heteroskedastic but 
the standard errors are computed using the homoskedasticity-only formula, then 
internal validity is compromised because the standard errors will be inconsistent. 
These latter problems can be addressed by computing the standard errors properly. 

A study using regression analysis, like any statistical study, is externally valid if 
its findings can be generalized beyond the population and setting studied. Sometimes 
it can help to compare two or more studies on the same topic. Whether or not there 
are two or more such studies, however, assessing external validity requires making 
judgments about the similarities of the population and setting studied and the popu- 
lation and setting to which the results are being generalized. 


If you are interested in learning more about the relationship between class size and test scores, see the 
reviews by Ehrenberg et al. (2001a, 2001b). 
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The next two parts of this text develop ways to address threats to internal validity 
that cannot be mitigated by multiple regression analysis alone. Part III extends the 
multiple regression model in ways designed to mitigate all five sources of potential 
bias in the OLS estimator. Part HI also discusses a different approach to obtaining 
internal validity, randomized controlled experiments, and it returns to the prediction 
problem when there are many predictors. Part IV develops methods for analyzing 
time series data and for using time series data to estimate so-called dynamic causal 
effects, which are causal effects that vary over time. 


Summary 


1. Statistical studies are evaluated by asking whether the analysis is internally and 
externally valid. A study is internally valid if the statistical inferences about 
causal effects are valid for the population being studied. A study is externally 
valid if its inferences and conclusions can be generalized from the population 
and setting studied to other populations and settings. 

2. In regression estimation of causal effects, there are two types of threats to 
internal validity. First, OLS estimators are biased and inconsistent if the regres- 
sors and error terms are correlated. Second, confidence intervals and hypoth- 
esis tests are not valid when the standard errors are incorrect. 

3. Regressors and error terms may be correlated when there are omitted vari- 
ables, an incorrect functional form is used, one or more of the regressors are 
measured with error, the sample is chosen nonrandomly from the popula- 
tion, or there is simultaneous causality between the regressors and dependent 
variables. 

4. Standard errors are incorrect when the errors are heteroskedastic and the com- 
puter software uses the homoskedasticity-only standard errors or when the 
error term is correlated across different observations. 

5. When regression models are used solely for prediction, it is not necessary for 
the regression coefficients to be unbiased estimates of causal effects. It is criti- 
cal, however, that the regression model be externally valid for the prediction 
application at hand. 


Key Terms 

population studied (330) classical measurement error 
population of interest (330) model (337) 

internal validity (331) sample selection bias (340) 
external validity (331) simultaneous causality (341) 
functional form misspecification (336) simultaneous equations bias (342) 


errors-in-variables bias (337) 
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www.pearsonglobaleditions.com. 


chapter, MyLab Economics Practice Tests and Study Plan 


Review the Concepts 


9.1 


9.2 


9.3 


9.4 


9.5 


9.6 


Explain the difference between internal validity and external validity. Is it 
possible for an econometric study to have internal validity but not external 
validity? 


Key Concept 9.2 describes the problem of variable selection in terms of a 
trade-off between bias and variance. What is this trade-off? Why could includ- 
ing an additional control variable decrease bias? Increase variance? 


What is the effect of measurement error in Y? How is this different from the 
effect of measurement error in X? 


What is sample selection bias? Suppose you read a study using data on college 
graduates of the effects of an additional year of schooling on earnings. What 
is the potential sample selection bias present? 


What is simultaneous causality bias? Explain the potential for simultaneous 
causality in a study of the effects of high levels of bureaucratic corruption on 
national income. 


A researcher estimates a regression using two different software packages. 
The first uses the homoskedasticity-only formula for standard errors. The 
second uses the heteroskedasticity-robust formula. The standard errors are 
very different. Which should the researcher use? Why? 


Exercises 


9.1 


9.2 


Suppose that you have just read a careful statistical study of the effect of 
improved health of children on their test scores at school. Using data from a 
project in a West African district in 2000, the study concluded that students who 
received multivitamin supplements performed substantially better at school. 
Use the concept of external validity to determine if these results are likely to 
apply to India in 2000, the United Kingdom in 2000, and West Africa in 2015. 


Consider the one-variable regression model Y, = By + BX; + u;, and 
suppose it satisfies the least squares assumptions in Key Concept 4.3. Sup- 
pose Y; is measured with error, so the data are Y; = Y, + w;, where w;is the 


9.3 


9.4 


9.5 
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measurement error, which is i.i.d. and independent of Y; and X;. Consider the 
population regression Y; = By + BX; + v; where v; is the regression error, 
using the mismeasured dependent variable, Y;. 


a. Show that V; = lUi + Wi. 


b. Show that the regression Ý, = fp + BX; + v; satisfies the least squares 
assumptions in Key Concept 4.3. (Assume that w;is independent of Y; 
and X; for all values of i and j and has a finite fourth moment.) 


c. Are the OLS estimators consistent? 
d. Can confidence intervals be constructed in the usual way? 


e. Evaluate these statements: “Measurement error in the X’s is a serious 
problem. Measurement error in Y is not.” 


Labor economists studying the determinants of women’s earnings discovered 
a puzzling empirical result. Using randomly selected employed women, they 
regressed earnings on the women’s number of children and a set of control 
variables (age, education, occupation, and so forth). They found that women 
with more children had higher wages, controlling for these other factors. 
Explain how sample selection might be the cause of this result. (Hint: Notice 
that women who do not work outside the home are missing from the sample.) 
[This empirical puzzle motivated James Heckman’s research on sample selec- 
tion that led to his 2000 Nobel Prize in Economics. See Heckman (1974).] 


Using the regressions shown in columns (2) of Tables 8.3 and 9.3, and column 
(2) of Table 9.2, construct a table like Table 9.3 and compare the estimated 
effects of a 10 percentage point increase in the students eligible for free lunch 
on test scores in California and Massachusetts. 


The demand for a commodity is given by Q = By + B,P + u, where Q denotes 
quantity, P denotes price, and u denotes factors other than price that determine 
demand. Supply for the commodity is given by Q = yọ + yıP + v, where v 
denotes factors other than price that determine supply. Suppose u and v both 
have a mean of 0, have variances ø? and 02, and are mutually uncorrelated. 


a. Solve the two simultaneous equations to show how Q and P depend on 
u and v. 

b. Derive the means of P and Q. 

c. Derive the variance of P, the variance of Q, and the covariance between 
Q and P. 


d. A random sample of observations of (Q;, P;) is collected, and Q; is 
regressed on P, (That is, Q; is the regressand, and P, is the regressor.) 
Suppose the sample is very large. 


i. Use your answers to (b) and (c) to derive values of the regression 
coefficients. [Hint: Use Equations (4.7) and (4.8).] 

ii. A researcher uses the slope of this regression as an estimate of the 
slope of the demand function (,).Is the estimated slope too large 
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9.6 


9.7 


9.8 


9.9 


9.10 


9.11 


9.12 


or too small? (Hint: Remember that demand curves slope down and 
supply curves slope up.) 


Suppose that n = 50 1.i.d. observations for ( Y, X;) yield the following regres- 
sion results: 


Y = 492 + 73.9X, SER = 13.4, R? = 0.78. 
(23.5) (16.4) 


Another researcher is interested in the same regression, but he makes an error 
when he enters the data into his regression program: He enters each obser- 
vation twice, so he has 100 observations (with observation 1 entered twice, 
observation 2 entered twice, and so forth). 


a. Using these 100 observations, what results will be produced by his 
regression program? (Hint: Write the “incorrect” values of the sample 
means, variances, and covariances of Y and X as functions of the 
“correct” values. Use these to determine the regression statistics.) 


Y= + X, SER = = 


= 
b. Which (if any) of the internal validity conditions are violated? 
Are the following statements true or false? Explain your answer. 


a. “An ordinary least squares regression of Y onto X will not be internally 
valid if Y is correlated with the error term.” 


b. “Ifthe error term exhibits heteroskedasticity, then the estimates of X 
will always be biased.” 


Would the regression in Equation (4.9) in chapter 4 be useful for predicting 
test scores in a school district in Massachusetts? Why or why not? 


Consider the linear regression of TestScore on Income shown in Figure 8.2 and the 
nonlinear regression in Equation (8.18). Would either of these regressions provide 
a reliable estimate of the causal effect of income on test scores? Would either of 
these regressions provide a reliable method for predicting test scores? Explain. 


Read the box “The Effect of Ageing on Healthcare Expenditures: A Red Her- 
ring?” in Section 8.3. Discuss the internal and external validity as a causal 
effect of the relationship between age and healthcare expenditures, consider- 
ing both models 1 and 3. 


Read the box “The Demand for Economics Journals” in Section 8.3. Discuss 
the internal and external validity of the estimated effect of price per citation 
on subscriptions. 


Consider the one-variable regression model Y; = By + BX; + u;, and suppose 
it satisfies the least squares assumptions in Key Concept 4.3. The regressor X; 
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is missing, but data on a related variable, Z;, are available, and the value of X; 
is estimated usingX, = E(X;|Z;). Let w; = X; — X. 


a. Show that x. is the minimum mean square error estimator of X; using Z;. 
That is, let Be = g(Z;) be some other guess of X; based on Z;, and show 
that E| (£; — X;)?] = E[(X; — X;)?].(Hint: Review Exercise 2.27) 

b. Show that E(w;|X;) = 0. 

c. Suppose that E(u;|Z;) = 0 and that X, is used as the regressor in place 
of X;. Show that Bi is consistent. Is Êo consistent? 


9.13 Assume that the regression model Y, = By + BX; + u; satisfies the least 
squares assumptions in Key Concept 4.3. You and a friend collect a random 
sample of 300 observations on Y and X. 


a. Your friend reports that he inadvertently scrambled the X observations 
for 20% of the sample. For these scrambled observations, the value of X 
does not correspond to X; for the i™ observation; rather, it corresponds to 
the value of X for some other observation. In the notation of Section 9.2, 
the measured value of the regressor, X;, is equal to X; for 80% of the 
observations, but it is equal to a randomly selected X; for the remaining 
20% of the observations. You regress Y; on X;. Show that E( Bi) = 0.86). 


b. Explain how you could construct an unbiased estimate of 6, using the 
OLS estimator in (a). 


c. Suppose now your friend tells you that the X’s were scrambled for the 
first 60 observations but that the remaining 240 observations are correct. 
You estimate 64 by regressing Y on X, using only the correctly measured 
240 observations. Is this estimator of 6, better than the estimator you 
proposed in (b)? Explain. 


Empirical Exercises 


E9.1 Use the data set CPS2015, described in Empirical Exercise 8.2, to answer the 
following questions. 


a. Discuss the internal validity of the regressions that you used to answer 
Empirical Exercise 8.2(1). Include a discussion of possible omitted vari- 
able bias, misspecification of the functional form of the regression, errors 
in variables, sample selection, simultaneous causality, and inconsistency 
of the OLS standard errors. 

b. The data set CPS96_15 described in Empirical Exercise 3.1 includes data from 
1996 and 2015. Use these data to investigate the (temporal) external valid- 
ity of the conclusions that you reached in Empirical Exercise 8.2(1). [Note: 
Remember to adjust for inflation, as explained in Empirical Exercise 3.1(b).] 
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APPENDIX 


9.1 


E9.2 Use the data set Birthweight_Smoking introduced in Empirical Exercise 5.3 
to answer the following questions. 


a. In Empirical Exercise 7.1(f), you estimated several regressions and were 
asked: “What is a reasonable 95% confidence interval for the effect of 
smoking on birth weight?” 


i. In Chapter 8, you learned about nonlinear regressions. Can you 
think of any nonlinear regressions that can potentially improve your 
answer to Empirical Exercise 71(f)? After estimating these addi- 
tional regressions, what is a reasonable 95% confidence interval for 
the effect of smoking on birth weight? 


ii. Discuss the internal validity of the regressions you used to construct 
the confidence interval. Include a discussion of possible omitted 
variable bias, misspecification of the functional form of the regres- 
sion, errors in variables, sample selection, simultaneous causality, and 
inconsistency of the OLS standard errors. 


b. The data set Birthweight_Smoking includes babies born in Pennsylvania 


in 1989. Discuss the external validity of your analysis for (i) California in 
1989, (ii) Illinois in 2019, and (iii) South Korea in 2019. 


The Massachusetts Elementary School 
Testing Data 


The Massachusetts data are district-wide averages for public elementary school districts in 
1998. The test score is taken from the Massachusetts Comprehensive Assessment System 
(MCAS) test administered to all fourth graders in Massachusetts public schools in the spring 
of 1998. The test is sponsored by the Massachusetts Department of Education and is manda- 
tory for all public schools. The data analyzed here are the overall total score, which is the sum 
of the scores on the English, math, and science portions of the test. 

Data on the student-teacher ratio, the percentage of students receiving a subsidized lunch, 
and the percentage of students still learning English are averages for each elementary school 
district for the 1997-1998 school year and were obtained from the Massachusetts Department 


of Education. Data on average district income were obtained from the 1990 U.S. Census. 


0 Regression with Panel Data 


ultiple regression is a powerful tool for controlling for the effect of variables on 

which we have data. If data are not available for some of the variables, however, 
they cannot be included in the regression, and the OLS estimators of the regression 
coefficients could have omitted variable bias. 

This chapter describes a method for controlling for some types of omitted 
variables without actually observing them. This method requires a specific type of 
data, called panel data, in which each observational unit, or entity, is observed at 
two or more time periods. By studying changes in the dependent variable over 
time, it is possible to eliminate the effect of omitted variables that differ across 
entities but are constant over time. 

The empirical application in this chapter concerns drunk driving: What are the 
effects of alcohol taxes and drunk driving laws on traffic fatalities? We address this 
question using data on traffic fatalities, alcohol taxes, drunk driving laws, and 
related variables for the 48 contiguous U.S. states for each of the seven years from 
1982 to 1988. This panel data set lets us control for unobserved variables that differ 
from one state to the next, such as prevailing cultural attitudes toward drinking and 
driving, but do not change over time. It also allows us to control for variables that 
vary through time, like improvements in the safety of new cars, but do not vary 
across states. 

Section 10.1 describes the structure of panel data and introduces the drunk 
driving data set. Fixed effects regression, the main tool for regression analysis of 
panel data, is an extension of multiple regression that exploits panel data to 
control for variables that differ across entities but are constant over time. Fixed 
effects regression is introduced in Sections 10.2 and 10.3, first for the case of only 
two time periods and then for multiple time periods. In Section 10.4, these 
methods are extended to incorporate so-called time fixed effects, which control 
for unobserved variables that are constant across entities but change over time. 
Section 10.5 discusses the panel data regression assumptions and standard errors 
for panel data regression. In Section 10.6, we use these methods to study the 
effect of alcohol taxes and drunk driving laws on traffic deaths. 


361 


362 CHAPTER 10_ Regression with Panel Data 


Notation for Panel Data 


10.1 


10.1 


Panel data consist of observations on the same n entities at two or more time 
periods T, as is illustrated in Table 1.3. If the data set contains observations on the 
variables X and Y, then the data are denoted 


OG i= 4 mands = ooa I (10.1) 


where the first subscript, i, refers to the entity being observed and the second 
subscript, t, refers to the date at which it is observed. 


Panel Data 


Recall from Section 1.3 that panel data (also called longitudinal data) refers to data 
for n different entities observed at T different time periods. The state traffic fatality 
data studied in this chapter are panel data. Those data are for n = 48 entities (states), 
where each entity is observed in T = 7 time periods (each of the years 1982,..., 
1988), for a total of 7 X 48 = 336 observations. 

When describing cross-sectional data, it was useful to use a subscript to denote 
the entity; for example, Y; referred to the variable Y for the i“ entity. When describing 
panel data, we need some additional notation to keep track of both the entity and 
the time period. We do so by using two subscripts rather than one: The first, i, refers 
to the entity, and the second, t, refers to the time period of the observation. Thus Y; 
denotes the variable Y observed for the i™ of n entities in the t™ of T periods. This 
notation is summarized in Key Concept 10.1. 

Some additional terminology associated with panel data describes whether some 
observations are missing. A balanced panel has all its observations; that is, the vari- 
ables are observed for each entity and each time period. A panel that has some miss- 
ing data for at least one time period for at least one entity is called an unbalanced 
panel. The traffic fatality data set has data for all 48 contiguous U.S. states for all 
seven years, so it is balanced. If, however, some data were missing (for example, if we 
did not have data on fatalities for some states in 1983), then the data set would be 
unbalanced. The methods presented in this chapter are described for a balanced 
panel; however, all these methods can be used with an unbalanced panel, although 
precisely how to do so in practice depends on the regression software being used. 


Example: Traffic Deaths and Alcohol Taxes 


There are approximately 40,000 highway traffic fatalities each year in the United 
States. Approximately one-fourth of fatal crashes involve a driver who was drinking, 
and this fraction rises during peak drinking periods. One study (Levitt and Porter, 
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2001) estimates that as many as 25% of drivers on the road between 1 a.m. and 3 a.m. 
have been drinking and that a driver who is legally drunk is at least 13 times as likely 
to cause a fatal crash as a driver who has not been drinking. 

In this chapter, we study how effective various government policies designed to 
discourage drunk driving actually are in reducing traffic deaths. The panel data set 
contains variables related to traffic fatalities and alcohol, including the number of 
traffic fatalities in each state in each year, the type of drunk driving laws in each state 
in each year, and the tax on beer in each state. The measure of traffic deaths we use 
is the fatality rate, which is the number of annual traffic deaths per 10,000 people in 
the population in the state. The measure of alcohol taxes we use is the “real” tax on 
a case of beer, which is the beer tax, put into 1988 dollars by adjusting for inflation." 
The data are described in more detail in Appendix 10.1. 

Figure 10.1a is a scatterplot of the data for 1982 on two of these variables, the 
fatality rate and the real tax on a case of beer. A point in this scatterplot represents 
the fatality rate in 1982 and the real beer tax in 1982 for a given state. The OLS 
regression line obtained by regressing the fatality rate on the real beer tax is also 
plotted in the figure; the estimated regression line is 


naam 
FatalityRate = 2.01 + 0.15 BeerTax (1982 data). 


(0.15) (0.13) ua 


The coefficient on the real beer tax is positive but not statistically significant at the 
10% level. 

Because we have data for more than one year, we can reexamine this relation- 
ship for another year. This is done in Figure 10.1b, which is the same scatterplot as 
before except that it uses the data for 1988. The OLS regression line through these 
data is 


iaraa 
FatalityRate = 1.86 + 0.44 BeerTax (1988 data). 


(0.11) (0.13) a 


In contrast to the regression using the 1982 data, the coefficient on the real beer 
tax is statistically significant at the 1% level (the t-statistic is 3.43). Curiously, the 
estimated coefficients for the 1982 and the 1988 data are positive: Taken literally, 
higher real beer taxes are associated with more, not fewer, traffic fatalities. 

Should we conclude that an increase in the tax on beer leads to more traffic 
deaths? Not necessarily, because these regressions could have substantial omitted 
variable bias. Many factors affect the fatality rate, including the quality of the 
automobiles driven in the state, whether the state highways are in good repair, 
whether most driving is rural or urban, the density of cars on the road, and whether 
it is socially acceptable to drink and drive. Any of these factors may be correlated 


'To make the taxes comparable over time, they are put into 1988 dollars using the Consumer Price Index 
(CPI). For example, because of inflation, a tax of $1 in 1982 corresponds to a tax of $1.23 in 1988 dollars. 
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beer tax. 


Figure 10.1a is a scatterplot of 
traffic fatality rates and the real 
tax on a case of beer (in 1988 
dollars) for 48 states in 1982. 4.0 
Figure 10.1b shows the data 
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| FIGURE 10.1 | The Traffic Fatality Rate and the Tax on Beer 
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(b) 1988 data 


with alcohol taxes, and if so, this will lead to omitted variable bias. One approach 
to these potential sources of omitted variable bias would be to collect data on all 
these variables and add them to the annual cross-sectional regressions in Equa- 
tions (10.2) and (10.3). Unfortunately, some of these variables, such as the cultural 
acceptance of drinking and driving, might be very hard or even impossible to 


measure. 


10.2 


10.2 Panel Data with Two Time Periods: “Before and After” Comparisons 365 


If these factors remain constant over time in a given state, however, then another 
route is available. Because we have panel data, we can, in effect, hold these factors 
constant even though we cannot measure them. To do so, we use OLS regression with 
fixed effects. 


Panel Data with Two Time Periods: 
“Before and After” Comparisons 


When data for each state are obtained for T = 2 time periods, it is possible to com- 
pare values of the dependent variable in the second period to values in the first 
period. By focusing on changes in the dependent variable, this “before and after” or 
“differences” comparison, in effect, holds constant the unobserved factors that differ 
from one state to the next but do not change over time within the state. 

Let Z; be a variable that determines the fatality rate in the i™ state but does not 
change over time (so the f subscript is omitted). For example, Z; might be the local 
cultural attitude toward drinking and driving, which changes slowly and thus could 
be considered to be constant between 1982 and 1988. Accordingly, the population 
linear regression relating Z; and the real beer tax to the fatality rate is 


Fatality Rate; = By) + B,BeerTax;, + BZ; + lip (10.4) 


where u;, is the error term, į = 1,...,m,andt=1,...,T. 

Because Z; does not change over time, in the regression model in Equation (10.4) 
it will not produce any change in the fatality rate between 1982 and 1988. Thus, in this 
regression model, the influence of Z; can be eliminated by analyzing the change in 
the fatality rate between the two periods. To see this mathematically, consider 
Equation (10.4) for each of the two years 1982 and 1988: 


Fatality Ratej,99 — Bo + B, BeerTax 1930 + RoZ; + Uj19825 (10.5) 
Fatality Rate; 9g = Bo + BıBeerTaxi9g8 + b2Zi + Ui1988. (10.6) 


Subtracting Equation (10.5) from Equation (10.6) eliminates the effect of Z; 


Fatality Ratej,9g, — Fatality Ratej,9g7 
= B,( BeerTax;j9gg — BeerTax;j9g2) + Uiogs — Ui1982- (10.7) 


This specification has an intuitive interpretation. Cultural attitudes toward drinking 
and driving affect the level of drunk driving and thus the traffic fatality rate in a state. 
If, however, they did not change between 1982 and 1988, then they did not produce 
any change in fatalities in the state. Rather, any changes in traffic fatalities over time 
must have arisen from other sources. In Equation (10.7), these other sources are 
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changes in the tax on beer and changes in the error term (which captures changes in 
other factors that determine traffic deaths). 

Specifying the regression in changes in Equation (10.7) eliminates the effect of 
the unobserved variables Z; that are constant over time. In other words, analyzing 
changes in Y and X has the effect of controlling for variables that are constant over 
time, thereby eliminating this source of omitted variable bias. 

Figure 10.2 presents a scatterplot of the change in the fatality rate between 1982 
and 1988 against the change in the real beer tax between 1982 and 1988 for the 48 
states in our data set. A point in Figure 10.2 represents the change in the fatality rate 
and the change in the real beer tax between 1982 and 1988 for a given state. The OLS 
regression line, estimated using these data and plotted in the figure, is 
Se 

Fatality Rate,ogg, — Fatality Rate,og. = —0.072 — 1.04( BeerTaxıogg — BeerTax19g2). 
(0.065) (0.36) (10.8) 


Including an intercept in Equation (10.8) allows for the possibility that the mean 
change in the fatality rate, in the absence of a change in the real beer tax, is nonzero. 
For example, the negative intercept (—0.072) could reflect improvements in auto 
safety between 1982 and 1988 that reduced the average fatality rate. 

In contrast to the cross-sectional regression results, the estimated effect of a change 
in the real beer tax is negative, as predicted by economic theory. The hypothesis that 
the population slope coefficient is 0 is rejected at the 5% significance level. According 
to this estimated coefficient, an increase in the real beer tax by $1 per case reduces the 
traffic fatality rate by 1.04 deaths per 10,000 people. This estimated effect is very large: 


| FIGURE 10.2 | Changes in Fatality Rates and Beer Taxes from 1982 to 1988 
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The average fatality rate is approximately 2 in these data (that is, 2 fatalities per year 
per 10,000 members of the population), so the estimate suggests that traffic fatalities 
can be cut in half merely by increasing the real tax on beer by $1 per case. 

By examining changes in the fatality rate over time, the regression in Equation 
(10.8) controls for fixed factors such as cultural attitudes toward drinking and driving. 
But there are many factors that influence traffic safety, and if they change over time 
and are correlated with the real beer tax, then their omission will produce omitted 
variable bias. In Section 10.6, we undertake a more careful analysis that controls for 
several such factors, so for now it is best to refrain from drawing any substantive 
conclusions about the effect of real beer taxes on traffic fatalities. 

This “before and after” or “differences” analysis works when the data are observed 
in two different years. Our data set, however, contains observations for seven different 
years, and it seems foolish to discard those potentially useful additional data. But the 
“before and after” method does not apply directly when T > 2. To analyze all the 
observations in our panel data set, we use the method of fixed effects regression. 


Fixed Effects Regression 


Fixed effects regression is a method for controlling for omitted variables in panel 
data when the omitted variables vary across entities (states) but do not change over 
time. Unlike the “before and after” comparisons of Section 10.2, fixed effects regres- 
sion can be used when there are two or more time observations for each entity. 

The fixed effects regression model has n different intercepts, one for each entity. 
These intercepts can be represented by a set of binary (or indicator) variables. These 
binary variables absorb the influences of all omitted variables that differ from one 
entity to the next but are constant over time. 


The Fixed Effects Regression Model 


Consider the regression model in Equation (10.4) with the dependent variable 
(Fatality Rate) and observed regressor (BeerTax) denoted as Y; and X;,, respectively: 


Yı = Bo + BiXin + BoZi + Uin (10.9) 


where Z; is an unobserved variable that varies from one state to the next but does 
not change over time (for example, Z; represents cultural attitudes toward drinking 
and driving). We want to estimate 64, the effect on Y of X, holding constant the unob- 
served state characteristics Z. 

Because Z; varies from one state to the next but is constant over time, the popu- 
lation regression model in Equation (10.9) can be interpreted as having n intercepts, 
one for each state. Specifically, let a; = By + P)Z;.Then Equation (10.9) becomes 


Yı = PiXu + Qi + Wie (10.10) 
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Equation (10.10) is the fixed effects regression model, in which ay, . . . , a, are treated 
as unknown intercepts to be estimated, one for each state. The interpretation of a; as 
a state-specific intercept in Equation (10.10) comes from considering the population 
regression line for the i“ state; this population regression line is a; + B,X;,. The slope 
coefficient of the population regression line, 64, is the same for all states, but the 
intercept of the population regression line varies from one state to the next. 

Because the intercept a; in Equation (10.10) can be thought of as the “effect” of 
being in entity i (in the current application, entities are states), the terms ay,..., a, 
are known as entity fixed effects. The variation in the entity fixed effects comes from 
omitted variables that, like Z; in Equation (10.9), vary across entities but not over 
time. 

The state-specific intercepts in the fixed effects regression model also can be 
expressed using binary variables to denote the individual states. Section 8.3 consid- 
ered the case in which the observations belong to one of two groups and the popula- 
tion regression line has the same slope for both groups but different intercepts (see 
Figure 8.8a). That population regression line was expressed mathematically using a 
single binary variable indicating one of the groups (case 1 in Key Concept 8.4). If we 
had only two states in our data set, that binary variable regression model would apply 
here. Because we have more than two states, however, we need additional binary 
variables to capture all the state-specific intercepts in Equation (10.10). 

To develop the fixed effects regression model using binary variables, let D1; be 
a binary variable that equals 1 when i = 1 and equals 0 otherwise, let D2; equal 1 
when i = 2 and equal 0 otherwise, and so on. We cannot include all n binary variables 
plus a common intercept, for if we do, the regressors will be perfectly multicollinear 
(this is the dummy variable trap of Section 6.7), so we arbitrarily omit the binary 
variable D1; for the first entity. Accordingly, the fixed effects regression model in 
Equation (10.10) can be written equivalently as 


Yı = Bo + BiXu + y2D2; + y3D3; + +++ + YnDNi + Uin (10.11) 


where Bp, Bi, Y2, --- , Yn are unknown coefficients to be estimated. To derive the 
relationship between the coefficients in Equation (10.11) and the intercepts in 
Equation (10.10), compare the population regression lines for each state in the two 
equations. In Equation (10.11), the population regression equation for the first state 
is By + BX, sO a; = Bo. For the second and remaining states, it is By + ByX% + Yi 
soa; = By + y; fori = 2. 

Thus there are two equivalent ways to write the fixed effects regression model, 
Equations (10.10) and (10.11). In Equation (10.10), it is written in terms of n state- 
specific intercepts. In Equation (10.11), the fixed effects regression model has a com- 
mon intercept andn — 1 binary regressors. In both formulations, the slope coefficient 
on X is the same from one state to the next. The state-specific intercepts in Equation 
(10.10) and the binary regressors in Equation (10.11) have the same source: the 
unobserved variable Z; that varies across states but not over time. 
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The Fixed Effects Regression Model 


10.2 


The fixed effects regression model is 
W= BX it ae nar BX it ap GH AP Ui (10.12) 


where i = 1,...,m;¢=1,..., T; Xj, S the value of the first regressor for entity i 
in time period t, Xx is the value of the second regressor, and so forth; and 
a),..., @, are entity-specific intercepts. 

Equivalently, the fixed effects regression model can be written in terms of a com- 
mon intercept, the X’s, and n — 1 binary variables representing all but one entity: 


Ve = [iy ar (ERG, eae 82 ar (EP AGe a ae VD 
F ypD3 H- t se IDs F hap (10.13) 


where D2; = 1 if i = 2 and D2; = 0 otherwise, and so forth. 


Extension to multiple X’s. If there are other observed determinants of Y that are 
correlated with X and that change over time, then these should also be included in 
the regression to avoid omitted variable bias. Doing so results in the fixed effects 
regression model with multiple regressors, summarized in Key Concept 10.2. 


Estimation and Inference 


In principle, the binary variable specification of the fixed effects regression model 
[Equation (10.13)] can be estimated by OLS. This regression, however, has k + n 
regressors (the k X’s, the n — 1 binary variables, and the intercept), so in practice this 
OLS regression is tedious or, in some software packages, impossible to implement if 
the number of entities is large. Econometric software therefore has special routines 
for OLS estimation of fixed effects regression models. These special routines are 
equivalent to using OLS on the full binary variable regression, but they are faster 
because they employ some mathematical simplifications that arise in the algebra of 
fixed effects regression. 


The “entity-demeaned” OLS algorithm. Regression software typically computes 
the OLS fixed effects estimator in two steps. In the first step, the entity-specific 
average is subtracted from each variable. In the second step, the regression is 
estimated using “entity-demeaned” variables. Specifically, consider the case of a 
single regressor in the version of the fixed effects model in Equation (10.10), and take 
the average of both sides of Equation (10.10); then Y, = BX; + a; + T; where 
X = (1/T)>1_-1Y;, and X, and @; are defined similarly. Thus Equation (10.10) 
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implies that Y, — Y, = Bi (Xie — X;) + (uz — U). Let F, = ¥,— Y, Xa = X, — X; 
and uj, = Ui — Uj accordingly, 


F, = BX + Up (10.14) 


Thus £; can be estimated by the OLS regression of the “entity-demeaned” variables 
Y, on Kor In fact, this estimator is identical to the OLS estimator of 6, obtained by 
estimation of the fixed effects model in Equation (10.11) using n — 1 binary vari- 
ables (Exercise 19.6). 


The “before and after” (differences) regression versus the binary variables specifi- 
cation. Although Equation (10.11) with its binary variables looks quite different 
from the “before and after” regression model in Equation (10.7), in the special case 
that T = 2 the OLS estimator of £; from the binary variable specification and that 
from the “before and after” specification are identical if the intercept is excluded 
from the “before and after” specification. Thus, when T = 2, there are three ways to 
estimate B, by OLS: the “before and after” specification in Equation (10.7) (without 
an intercept), the binary variable specification in Equation (10.11), and the entity- 
demeaned specification in Equation (10.14). These three methods are equivalent; 
that is, they produce identical OLS estimates of 8, (Exercise 10.11). 


The sampling distribution, standard errors, and statistical inference. In multiple 
regression with cross-sectional data, if the four least squares assumptions in Key Con- 
cept 6.4 hold, then the sampling distribution of the OLS estimator is normal in large 
samples. The variance of this sampling distribution can be estimated from the data, and 
the square root of this estimator of the variance— that is, the standard error—can be 
used to test hypotheses using a t-statistic and to construct confidence intervals. 

Similarly, in multiple regression with panel data, if a set of assumptions — called 
the fixed effects regression assumptions — holds, then the sampling distribution of the 
fixed effects OLS estimator is normal in large samples, the variance of that distribu- 
tion can be estimated from the data, the square root of that estimator is the standard 
error, and the standard error can be used to construct f-statistics and confidence 
intervals. Given the standard error, statistical inference — testing hypotheses (includ- 
ing joint hypotheses using F-statistics) and constructing confidence intervals—pro- 
ceeds in exactly the same way as in multiple regression with cross-sectional data. 

The fixed effects regression assumptions and standard errors for fixed effects 
regression are discussed further in Section 10.5. 


Application to Traffic Deaths 


The OLS estimate of the fixed effects regression line relating the real beer tax to the 
fatality rate, based on all 7 years of data (336 observations), is 


ee, 
FatalityRate = —0.66 BeerTax + state fixed effects, 


pe (10.15) 
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where, as is conventional, the estimated state fixed intercepts are not listed to save 
space and because they are not of primary interest in this application. 

Like the “before and after” specification in Equation (10.8), the estimated coef- 
ficient in the fixed effects regression in Equation (10.15) is negative, so, as predicted 
by economic theory, higher real beer taxes are associated with fewer traffic deaths, 
which is the opposite of what we found in the initial cross-sectional regressions of 
Equations (10.2) and (10.3). The two regressions are not identical because the “before 
and after” regression in Equation (10.8) uses only the data for 1982 and 1988 (specifi- 
cally, the difference between those two years), whereas the fixed effects regression in 
Equation (10.15) uses the data for all 7 years. Because of the additional observations, 
the standard error is smaller in Equation (10.15) than in Equation (10.8). 

Including state fixed effects in the fatality rate regression lets us avoid omitted 
variables bias arising from omitted factors, such as cultural attitudes toward drinking 
and driving, that vary across states but are constant over time within a state. Still, a 
skeptic might suspect that other factors could lead to omitted variables bias. For 
example, over this period cars were getting safer, and occupants were increasingly 
wearing seat belts; if the real tax on beer rose, on average, during the mid-1980s, then 
BeerTax could be picking up the effect of overall automobile safety improvements. 
If, however, safety improvements evolved over time but were the same for all states, 
then we can eliminate their influence by including time fixed effects. 


Regression with Time Fixed Effects 


Just as fixed effects for each entity can control for variables that are constant over 
time but differ across entities, so time fixed effects can control for variables that are 
constant across entities but evolve over time. 

Because safety improvements in new cars are introduced nationally, they serve 
to reduce traffic fatalities in all states. So it is plausible to think of automobile safety 
as an omitted variable that changes over time but has the same value for all states. 
The population regression in Equation (10.9) can be modified to make explicit the 
effect of automobile safety, which we will denote S; 


Ya = Bo + BX + PoZi + B3S; + Uin (10.16) 


where S, is unobserved and where the single t subscript emphasizes that safety 
changes over time but is constant across states. Because £35, represents variables that 
determine Y;,,, if S, is correlated with X;,, then omitting S, from the regression leads to 
omitted variable bias. 


Time Effects Only 


For the moment, suppose that the variables Z; are not present, so that the term BZ; 
can be dropped from Equation (10.16), although the term 63S, remains. Our objective 
is to estimate 64, controlling for S;. 
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Although S, is unobserved, its influence can be eliminated because it varies over time 
but not across states, just as it is possible to eliminate the effect of Z;, which varies across 
states but not over time. In the entity fixed effects model, the presence of Z; leads to the 
fixed effects regression model in Equation (10.10), in which each state has its own inter- 
cept (or fixed effect). Similarly, because S, varies over time but not over states, the pres- 
ence of S, leads to a regression model in which each time period has its own intercept. 

The time fixed effects regression model with a single X regressor is 


Ye = BX + AY + Ui (10.17) 


This model has a different intercept, A,, for each time period. The intercept A, in 
Equation (10.17) can be thought of as the “effect” on Y of year t (or, more generally, 
time period f), so the terms A;,..., A7 are known as time fixed effects. The variation 
in the time fixed effects comes from omitted variables that, like S, in Equation (10.16), 
vary over time but not across entities. 

Just as the entity fixed effects regression model can be represented using n — 1 
binary indicators, so, too, can the time fixed effects regression model be represented 
using T — 1 binary indicators: 


Yı = Bo + BiXu + 62B2, + +++ + ôrBT, + tip (10.18) 


where ô, ..., ôr are unknown coefficients and where B2, = 1 if t = 2 and B2, = 0 
otherwise, and so forth. As in the fixed effects regression model in Equation (10.11), 
in this version of the time effects model the intercept is included, and the first binary 
variable (B1,) is omitted to prevent perfect multicollinearity. 

When there are additional observed “X” regressors, then these regressors appear 
in Equations (10.17) and (10.18) as well. 

In the traffic fatalities regression, the time fixed effects specification allows us to 
eliminate bias arising from omitted variables like nationally introduced safety stan- 
dards that change over time but are the same across states in a given year. 


Both Entity and Time Fixed Effects 


If some omitted variables are constant over time but vary across states (such as cultural 

norms), while others are constant across states but vary over time (such as national 

safety standards), then it is appropriate to include both entity (state) and time effects. 
The combined entity and time fixed effects regression model is 


Yı = BrXip + aj + Ap + Uin (10.19) 


where aq; is the entity fixed effect and A, is the time fixed effect. This model can 
equivalently be represented using n — 1 entity binary indicators and T — 1 time 
binary indicators, along with an intercept: 
Ya = Bo + BiXie + ¥2D2; + +++ + YnDnj 
+ 6)B2, + ++: + 67BT; + usp, (10.20) 


where bp, Bi; Y2, - - -> Yn and 63,..., Ôr are unknown coefficients. 
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When there are additional observed “X” regressors, then these appear in Equa- 
tions (10.19) and (10.20) as well. 

The combined entity and time fixed effects regression model eliminates omitted 
variables bias arising both from unobserved variables that are constant over time and 
from unobserved variables that are constant across states. 


Estimation. The time fixed effects model and the entity and time fixed effects model 
are both variants of the multiple regression model. Thus their coefficients can be 
estimated by OLS by including the additional time and entity binary variables. Alter- 
natively, in a balanced panel the coefficients on the X’s can be computed by first 
deviating Y and the X’s from their entity and time-period means and then by estimat- 
ing the multiple regression equation of deviated Y on the deviated X’s. This algo- 
rithm, which is commonly implemented in regression software, eliminates the need 
to construct the full set of binary indicators that appear in Equation (10.20). An 
equivalent approach is to deviate Y, the X’s, and the time indicators from their entity 
(but not time-period) means and to estimate k + T coefficients by multiple regres- 
sion of the deviated Y on the deviated X’s and the deviated time indicators. Finally, 
if T = 2, the entity and time fixed effects regression can be estimated using the 
“before and after” approach of Section 10.2, including the intercept in the regression. 
Thus the “before and after” regression reported in Equation (10.8), in which the 
change in Fatality Rate from 1982 to 1988 is regressed on the change in BeerTax from 
1982 to 1988 including an intercept, provides the same estimate of the slope coeffi- 
cient as the OLS regression of FatalityRate on BeerTax, including entity and time 
fixed effects, estimated using data for the two years 1982 and 1988. 


Application to traffic deaths. Adding time effects to the state fixed effects regres- 
sion results in the OLS estimate of the regression line: 


Se SS, 
FatalityRate = —0.64 BeerTax + State Fixed Effects + Time Fixed Effects. (10.21) 
(0.36) 


This specification includes the beer tax, 47 state binary variables (state fixed effects), 
6 single-year binary variables (time fixed effects), and an intercept, so this regression 
actually has 1 + 47 + 6 + 1 = 55 right-hand variables! The coefficients on the time 
and state binary variables and the intercept are not reported because they are not of 
primary interest. 

Including time effects has little impact on the coefficient on the real beer tax 
[compare Equations (10.15) and (10.21)]. Although this coefficient is less precisely 
estimated when time effects are included, it is still significant at the 10%, but not the 
5%, significance level (t = —0.64/0.36 = —1.78). 

This estimated relationship between the real beer tax and traffic fatalities is 
immune to omitted variable bias from variables that are constant either over time or 
across states. However, many important determinants of traffic deaths do not fall into 
this category, so this specification could still be subject to omitted variable bias. 
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10.5 


Section 10.6 therefore undertakes a more complete empirical examination of the 
effect of the beer tax and of laws aimed directly at eliminating drunk driving, control- 
ling for a variety of factors. Before turning to that study, we first discuss the assump- 
tions underlying panel data regression and the construction of standard errors for 
fixed effects estimators. 


The Fixed Effects Regression Assumptions and 
Standard Errors for Fixed Effects Regression 


In panel data, the regression error can be correlated over time within an entity. Like 
heteroskedasticity, this correlation does not introduce bias into the fixed effects esti- 
mator, but it affects the variance of the fixed effects estimator, and therefore it affects 
how one computes standard errors. The standard errors for fixed effects regressions 
reported in this chapter are so-called clustered standard errors, which are robust both 
to heteroskedasticity and to correlation over time within an entity. When there are 
many entities (when n is large), hypothesis tests and confidence intervals can be 
computed using the usual large-sample normal and F critical values. 

This section describes clustered standard errors. We begin with the fixed effects 
regression assumptions, which extend the least squares regression assumptions for 
causal inference to panel data; under these assumptions, the fixed effects estimator 
is consistent and asymptotically normally distributed when n is large. To keep the 
notation as simple as possible, this section focuses on the entity fixed effects regres- 
sion model of Section 10.3,in which there are no time effects. 


The Fixed Effects Regression Assumptions 


The four fixed effects regression assumptions are summarized in Key Concept 10.3. 
These assumptions extend the four least squares assumptions for causal inference, 
stated for cross-sectional data in Key Concept 6.4, to panel data. 

The first assumption is that the error term has conditional mean 0 given all T 
values of X for that entity. This assumption plays the same role as the first least 
squares assumption for cross-sectional data in Key Concept 6.4 and implies that 
there is no omitted variable bias. The requirement that the conditional mean of tti 
not depend on any of the values of X for that entity—past, present, or future —adds 
an important subtlety beyond the first least squares assumption for cross-sectional 
data. This assumption is violated if current u; is correlated with past, present, or 
future values of X. 

The second assumption is that the variables for one entity are distributed identi- 
cally to, but independently of, the variables for another entity; that is, the variables 
are 1.1.d. across entities fori = 1,...,7. Like the second least squares assumption in 
Key Concept 6.4, the second assumption for fixed effects regression holds if entities 
are selected by simple random sampling from the population. 
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The Fixed Effects Regression Assumptions 


10.3 


Y, = PiXı + a; + upi = 1,...,n,t=1,...,T, 
where & is the causal effect on Y of X and 
1. u, has conditional mean 0: E(w Xr X a Xr @;) = 0. 
2. (Xn, Xp, ..., XiT, Uin, Uin... , Uir), i = 1,...,n, are ii.d. draws from their 
joint distribution. 
3. Large outliers are unlikely: (Xj, up) have nonzero finite fourth moments. 


4. There is no perfect multicollinearity. 


For multiple regressors, X; should be replaced by the full list Xj in Xo, - -< > Xkit 


The third and fourth assumptions for fixed effects regression are analogous to 
the third and fourth least squares assumptions for cross-sectional data in Key 
Concept 6.4. 

Under the least squares assumptions for panel data in Key Concept 10.3, the 
fixed effects estimator is consistent and is normally distributed when n is large. The 
details are discussed in Appendix 10.2. 

An important difference between the panel data assumptions in Key Concept 
10.3 and the assumptions for cross-sectional data in Key Concept 6.4 is assumption 2. 
The cross-sectional counterpart of assumption 2 holds that each observation is inde- 
pendent, which arises under simple random sampling. In contrast, assumption 2 for 
panel data holds that the variables are independent across entities but makes no such 
restriction within an entity. For example, assumption 2 allows X; to be correlated 
over time within an entity. 

If X; is correlated with X; for different values of s and t—that is, if X; is corre- 
lated over time for a given entity—then X; is said to be autocorrelated (correlated 
with itself, at different dates) or serially correlated. Autocorrelation is a pervasive 
feature of time series data: What happens one year tends to be correlated with what 
happens the next year. In the traffic fatality example, X;,, the beer tax in state i in 
year t, is autocorrelated: Most of the time the legislature does not change the beer 
tax, so if it is high one year relative to its mean value for state i, it will tend to be high 
the next year, too. Similarly, it is possible to think of reasons why u; would be auto- 
correlated. Recall that u; consists of time-varying factors that are determinants of Y; 
but are not included as regressors, and some of these omitted factors might be auto- 
correlated. For example, a downturn in the local economy might produce layoffs and 
diminish commuting traffic, thus reducing traffic fatalities for 2 or more years. Simi- 
larly, a major road improvement project might reduce traffic accidents not only in the 
year of completion but also in future years. Such omitted factors, which persist over 


376 


CHAPTER 10_ Regression with Panel Data 


multiple years, produce autocorrelated regression errors. Not all omitted factors will 
produce autocorrelation in u;,; for example, severe winter driving conditions plausi- 
bly affect fatalities, but if winter weather conditions for a given state are indepen- 
dently distributed from one year to the next, then this component of the error term 
would be serially uncorrelated. In general, though, as long as some omitted factors 
are autocorrelated, then u; will be autocorrelated. 


Standard Errors for Fixed Effects Regression 


If the regression errors are autocorrelated, then the usual heteroskedasticity-robust 
standard error formula for cross-section regression [Equations (5.3) and (5.4)] is not 
valid. One way to see this is to draw an analogy to heteroskedasticity. In a regression 
with cross-sectional data, if the errors are heteroskedastic, then (as discussed in 
Section 5.4) the homoskedasticity-only standard errors are not valid because they 
were derived under the false assumption of homoskedasticity. Similarly, if the errors 
are autocorrelated, then the usual standard errors will not be valid because they were 
derived under the false assumption of no serial correlation. 

Standard errors that are valid if u;,is potentially heteroskedastic and potentially 
correlated over time within an entity are referred to as heteroskedasticity-and 
autocorrelation-robust (HAR) standard errors. The standard errors used in this 
chapter are one type of HAR standard errors, clustered standard errors. The term 
clustered arises because these standard errors allow the regression errors to have an 
arbitrary correlation within a cluster, or grouping, but assume that the regression 
errors are uncorrelated across clusters. In the context of panel data, each cluster 
consists of an entity. Thus clustered standard errors allow for heteroskedasticity and 
for arbitrary autocorrelation within an entity but treat the errors as uncorrelated 
across entities. That is, clustered standard errors allow for heteroskedasticity and 
autocorrelation in a way that is consistent with the second fixed effects regression 
assumption in Key Concept 10.3. 

Like heteroskedasticity-robust standard errors in regression with cross-sectional 
data, clustered standard errors are valid whether or not there is heteroskedasticity, 
autocorrelation, or both. If the number of entities n is large, inference using clustered 
standard errors can proceed using the usual large-sample normal critical values for 
t-statistics and F}, .. critical values for F-statistics testing q restrictions. 

In practice, there can be a large difference between clustered standard errors 
and standard errors that do not allow for autocorrelation of uw; For example, the 
usual (cross-sectional data) heteroskedasticity-robust standard error for the Beer- 
Tax coefficient in Equation (10.21) is 0.25, substantially smaller than the clustered 
standard error, 0.36, and the respective f-statistics testing 64 = 0 are —2.51 and 
—1.78. The reason we report the clustered standard error is that it allows for serial 
correlation of u; within an entity, whereas the usual heteroskedasticity-robust 
standard error does not. The formula for clustered standard errors is given in 
Appendix 10.2. 
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Drunk Driving Laws and Traffic Deaths 


Alcohol taxes are only one way to discourage drinking and driving. States differ in 
their punishments for drunk driving, and a state that cracks down on drunk driving 
could do so by toughening driving laws as well as raising taxes. If so, omitting these 
laws could produce omitted variable bias in the OLS estimator of the effect of real 
beer taxes on traffic fatalities, even in regressions with state and time fixed effects. In 
addition, because vehicle use depends in part on whether drivers have jobs and 
because tax changes can reflect economic conditions (a state budget deficit can lead 
to tax hikes), omitting state economic conditions also could result in omitted variable 
bias. In this section, we therefore extend the preceding analysis of traffic fatalities to 
include other driving laws and economic conditions. 

The results are summarized in Table 10.1. The format of the table is the same as 
that of the tables of regression results in Chapters 7 through 9: Each column reports 
a different regression, and each row reports a coefficient estimate and standard error, 
a 95% confidence interval for the coefficients on the policy variables of interest, a 
F-statistic and p-value, or other information about the regression. 

Column (1) in Table 10.1 presents results for the OLS regression of the fatality 
rate on the real beer tax without state and time fixed effects. As in the cross-sectional 
regressions for 1982 and 1988 [Equations (10.2) and (10.3)], the coefficient on the 
real beer tax is positive (0.36): According to this estimate, increasing beer taxes 
increases traffic fatalities! However, the regression in column (2) [reported previ- 
ously as Equation (10.15)], which includes state fixed effects, suggests that the posi- 
tive coefficient in regression (1) is the result of omitted variable bias (the coefficient 
on the real beer tax is —0.66). The regression R? jumps from 0.091 to 0.889 when fixed 
effects are included; evidently, the state fixed effects account for a large amount of 
the variation in the data. 

Little changes when time effects are added, as reported in column (3) [reported 
previously as Equation (10.21)], except that the beer tax coefficient is now estimated 
less precisely. The results in columns (1) through (3) are consistent with the omitted 
fixed factors—historical and cultural factors, general road conditions, population 
density, attitudes toward drinking and driving, and so forth— being important deter- 
minants of the variation in traffic fatalities across states. 

The next four regressions in Table 10.1 include additional potential determinants 
of fatality rates along with state and time effects. The base specification, reported in 
column (4), includes variables related to drunk driving laws plus variables that con- 
trol for the amount of driving and overall state economic conditions. The first legal 
variables are the minimum legal drinking age, represented by three binary variables 
for a minimum legal drinking age of 18, 19, and 20 (so the omitted group is a mini- 
mum legal drinking age of 21 or older). The other legal variable is the punishment 
associated with the first conviction for driving under the influence of alcohol, either 
mandatory jail time or mandatory community service (the omitted group is less 
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Dependent variable: traffic fatality rate (deaths per 10,000). 


Regressor (1) (2) (3) (4) 
Beer tax 0.36 —0.66 —0.64 —0.45 
(0.05) (0.29) (0.36) (0.30) 
(0.26, 0.46]  [-1.23,-0.09] [-1.35,0.07]  [-1.04, 0.14] 
Drinking age 18 0.10 0.03 
(0.07) 
[-0.11, 0.17] 
Drinking age 19 —0.02 
(0.05) 
[—0.12, 0.08] 
Drinking age 20 0.03 
(0.05) 
[—0.07,0.13] 
Drinking age 
Mandatory jail or 0.04 
community service? (0.10) 
[-0.17,0.25] 
Average vehicle miles 0.008 
per driver (0.007) 
Unemployment rate —0.063 
(0.013) 
Real income per 1.82 
capita (logarithm) (0.64) 
Years 1982-88 1982-88 1982-88 1982-88 
State effects? no yes yes yes 
Time effects? no no yes yes 
Clustered standard no yes yes yes 
errors? 
F-Statistics and p-Values Testing Exclusion of Groups of Variables 
Time effects = 0 4.22 10.12 
(0.002) = (<0.001) 
Drinking age 0.35 
coefficients = 0 (0.786) 
Unemployment rate, 29.62 
income per capita = 0 (<0.001) 
R? 0.091 0.889 0.891 0.926 


cients, and p-values are given in parentheses under the F-statistics. 
Na 


(5) 


—0.69 
(0.35) 
[-1.38, 0.00] 


-0.01 
(0.08) 
[-0.17,0.15] 


—0.08 
(0.07) 
[-0.21, 0.06] 


—0.10 
(0.06) 
[-0.21, 0.01] 


0.09 
(0.11) 
[-0.14, 0.31] 


0.017 
(0.011) 


1982-88 


yes 
yes 


yes 


3.48 
(0.006) 


1.41 
(0.253) 


0.893 


Regression Analysis of the Effect of Drunk Driving Laws on Traffic Deaths 


(6) (7) 
—0.46 —0.93 
(0.31) (0.34) 
[-1.07,0.15]  [—1.60, —0.26] 
0.04 
(0.10) 
[-0.16, 0.24] 
—0.07 
(0.10) 
[—0.26, 0.13] 
—0.11 
(0.13) 
[—0.36, 0.14] 
0.00 
(0.02) 
[—0.05, 0.04] 

0.04 0.09 
(0.10) (0.16) 
[-0.17.0.25]  [-0.24, 0.42] 
0.009 0.124 
(0.007) (0.049) 
—0.063 —0.091 

(0.013) (0.021) 
1.79 1.00 
(0.64) (0.68) 
1982-88 1982 & 1988 
only 
yes yes 
yes yes 
yes yes 
10.28 3749 
(<0.001)  (<0.001) 
0.42 
(0.738) 
31.96 25.20 
(<0.001) (<0.001) 
0.926 0.899 


These regressions were estimated using panel data for 48 U.S. states. Regressions (1) through (6) use data for all years 1982 
to 1988, and regression (7) uses data from 1982 and 1988 only. The data set is described in Appendix 10.1. Standard errors 
are given in parentheses under the coefficients, 95% confidence intervals are given in square brackets under the coeffi- 
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severe punishment). The three measures of driving and economic conditions are 


average vehicle miles per driver, the unemployment rate, and the logarithm of real 


(1988 dollars) personal income per capita (using the logarithm of income permits the 


coefficient to be interpreted in terms of percentage changes of income; see 


Section 8.2). The final regression in Table 10.1 follows the “before and after” approach 


of Section 10.2 and uses only data from 1982 and 1988; thus regression (7) extends 


the regression in Equation (10.8) to include the additional regressors. 


The regression in column (4) has four interesting results. 


Including the additional variables reduces the estimated effect of the beer 
tax from —0.64 in column (3) to —0.45 in column (4). One way to evaluate 
the magnitude of this coefficient is to imagine a state with an average real 
beer tax doubling its tax; because the average real beer tax in these data is 
approximately $0.50 per case (in 1988 dollars), this entails increasing the tax 
by $0.50 per case. The estimated effect of a $0.50 increase in the beer tax is to 
decrease the expected fatality rate by 0.45 x 0.50 = 0.23 deaths per 10,000. 
This estimated effect is large: Because the average fatality rate is 2 deaths per 
10,000, a reduction of 0.23 corresponds to reducing traffic deaths by nearly 
one-eighth. This said, the estimate is quite imprecise: Because the standard 
error on this coefficient is 0.30, the 95% confidence interval for this effect 
is —0.45 x 0.50 + 1.96 x 0.30 x 0.50 = (—0.52, 0.08). This wide 95% con- 
fidence interval includes 0, so the hypothesis that the beer tax has no effect 
cannot be rejected at the 5% significance level. 


The minimum legal drinking age is precisely estimated to have a small effect 
on traffic fatalities. According to the regression in column (4), the 95% con- 
fidence interval for the increase in the fatality rate in a state with a mini- 
mum legal drinking age of 18, relative to age 21, is (—0.11, 0.17). The joint 
hypothesis that the coefficients on the minimum legal drinking age variables 
are 0 cannot be rejected at the 10% significance level: The F-statistic testing 
the joint hypothesis that the three coefficients are 0 is 0.35, with a p-value 
of 0.786. 


The coefficient on the first offense punishment variable is also estimated to 
be small and is not significantly different from 0 at the 10% significance level. 


The economic variables have considerable explanatory power for traffic fatali- 
ties. High unemployment rates are associated with fewer fatalities: An increase 
in the unemployment rate by 1 percentage point is estimated to reduce traffic 
fatalities by 0.063 deaths per 10,000. Similarly, high values of real per cap- 
ita income are associated with high fatalities: The coefficient is 1.82, so a 1% 
increase in real per capita income is associated with an increase in traffic fatali- 
ties of 0.0182 deaths per 10,000 (see case I in Key Concept 8.2 for interpretation 
of this coefficient). According to these estimates, good economic conditions 
are associated with higher fatalities, perhaps because of increased traffic den- 
sity when the unemployment rate is low or greater alcohol consumption when 
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income is high. The two economic variables are jointly significant at the 0.1% 
significance level (the F-statistic is 29.62). 


Columns (5) through (7) of Table 10.1 report regressions that check the sensitiv- 
ity of these conclusions to changes in the base specification. The regression in column 
(5) drops the variables that control for economic conditions. The result is an increase 
in the estimated effect of the real beer tax, which becomes significant at the 5% level, 
but there is no appreciable change in the other coefficients. The sensitivity of the 
estimated beer tax coefficient to including the economic variables, combined with the 
statistical significance of the coefficients on those variables in column (4), indicates 
that the economic variables should remain in the base specification. The regression 
in column (6) shows that the results in column (4) are not sensitive to changing the 
functional form when the three drinking age indicator variables are replaced by 
the drinking age itself. When the coefficients are estimated using the changes of the 
variables from 1982 to 1988 [column (7)], as in Section 10.2, the findings from column 
(4) are largely unchanged except that the coefficient on the beer tax is larger and is 
significant at the 1% level. 

The strength of this analysis is that including state and time fixed effects miti- 
gates the threat of omitted variable bias arising from unobserved variables that either 
do not change over time (like cultural attitudes toward drinking and driving) or do 
not vary across states (like safety innovations). As always, however, it is important to 
think about possible threats to validity. One potential source of omitted variable bias 
is that the measure of alcohol taxes used here, the real tax on beer, could move with 
other alcohol taxes, which suggests interpreting the results as pertaining more broadly 
than just to beer. A subtler possibility is that hikes in the real beer tax could be asso- 
ciated with public education campaigns. If so, changes in the real beer tax could pick 
up the effect of a broader campaign to reduce drunk driving. 

Taken together, these results present a provocative picture of measures to con- 
trol drunk driving and traffic fatalities. According to these estimates, neither stiff 
punishments nor increases in the minimum legal drinking age have important effects 
on fatalities. In contrast, there is evidence that increasing alcohol taxes, as measured 
by the real tax on beer, does reduce traffic deaths, presumably through reduced alco- 
hol consumption. The imprecision of the estimated beer tax coefficient means, how- 
ever, that we should be cautious about drawing policy conclusions from this analysis 
and that additional research is warranted.” 


?For further analysis of these data, see Ruhm (1996). A meta-analysis by Wagenaar, Salois, and Komro 
(2009) of 112 studies of the effect of alcohol prices and taxes on consumption found elasticities of —0.46 
for beer, —0.69 for wine, and —0.80 for spirits and concluded that alcohol taxes have large effects on 
reducing consumption relative to other programs. Carpenter and Dobkin (2011) provide evidence that, 
in contrast to the findings here, raising the minimum legal drinking age substantially reduces fatalities 
among drivers in the affected age range, especially at night, although they do not control for the other 
variables in Table 10.1.To learn more about drunk driving and alcohol and about the economics of alcohol 
more generally, also see Cook and Moore (2000), Chaloupka, Grossman, and Saffer (2002), Young and 
Bielinska-Kwapisz (2006), and Dang (2008). 
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Conclusion 


This chapter showed how multiple observations over time on the same entity can be 
used to control for unobserved omitted variables that differ across entities but are 
constant over time. The key insight is that if the unobserved variable does not change 
over time, then any changes in the dependent variable must be due to influences 
other than these fixed characteristics. If cultural attitudes toward drinking and driv- 
ing do not change appreciably over 7 years within a state, then explanations for 
changes in the traffic fatality rate over those 7 years must lie elsewhere. 

To exploit this insight, you need data in which the same entity is observed at two 
or more time periods; that is, you need panel data. With panel data, the multiple 
regression model of Part II can be extended to include a full set of entity binary 
variables; this is the fixed effects regression model, which can be estimated by OLS. 
A twist on the fixed effects regression model is to include time fixed effects, which 
control for unobserved variables that change over time but are constant across enti- 
ties. Both entity and time fixed effects can be included in the regression to control 
for variables that vary across entities but are constant over time and for variables that 
vary over time but are constant across entities. 

Despite these virtues, entity and time fixed effects regression cannot control for 
omitted variables that vary both across entities and over time. And, obviously, panel 
data methods require panel data, which often are not available. Thus there remains 
a need for a method that can eliminate the influence of unobserved omitted variables 
when panel data methods cannot do the job. A powerful and general method for 
doing so is instrumental variables regression, the topic of Chapter 12. 


Summary 


1. Panel data consist of observations on multiple (n) entities —states, firms, people, 
and so forth—where each entity is observed at two or more time periods (7). 

2. Regression with entity fixed effects controls for unobserved variables that dif- 
fer from one entity to the next but remain constant over time. 

3. When there are two time periods, fixed effects regression can be estimated by 
a “before and after” regression of the change in Y from the first period to the 
second on the corresponding change in X. 

4. Entity fixed effects regression can be estimated by including binary variables 
for n — 1 entities plus the observable independent variables (the X’s) and an 
intercept. 

5. Time fixed effects control for unobserved variables that are the same across 
entities but vary over time. 

6. A regression with time and entity fixed effects can be estimated by including 
binary variables for n — 1 entities and binary variables for T — 1 time periods 
plus the X’s and an intercept. 
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7. In panel data, variables are typically autocorrelated—that is, correlated over 
time within an entity. Standard errors need to allow both for this autocor- 
relation and for potential heteroskedasticity, and one way to do so is to use 
clustered standard errors. 


Key Terms 

panel data (362) entity and time fixed effects regression 

balanced panel (362) model (372) 

unbalanced panel (362) autocorrelated (375) 

fixed effects regression model (368) serially correlated (375) 

entity fixed effects (368) heteroskedasticity-and 

time fixed effects regression model autocorrelation-robust (HAR) 
(372) standard errors (376) 

time fixed effects (372) clustered standard errors (376) 
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Review the Concepts 


10.1 What is meant by panel data? What is the advantage of using such data to 
make statistical and economic inferences? 


10.2 A researcher is using a panel data set on n = 1000 workers over T = 10 years 
(from 2008 through 2017) that contains the workers’ earnings, sex, educa- 
tion, and age. The researcher is interested in the effect of education on earn- 
ings. Give some examples of unobserved person-specific variables that are 
correlated with both education and earnings. Can you think of examples of 
time-specific variables that might be correlated with education and earnings? 
How would you control for these person-specific and time-specific effects in 
a panel data regression? 


10.3 Can the regression that you suggested in response to Question 10.2 be used 
to estimate the effect of a worker’s sex on his or her earnings? Can that 
regression be used to estimate the effect of the national unemployment rate 
on an individual’s earnings? Explain. 
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10.4 Inthe context of the regression you suggested for Question 10.2, explain why 


the regression error for a given individual might be serially correlated. 


Exercises 


10.1 This exercise refers to the drunk driving panel data regression summarized in 
Table 10.1. 


a. 


New Jersey has a population of 8.85 million people. Suppose New Jersey 
increased the tax on a case of beer by $2 (in 1988 dollars). Use the 
results in column (5) to predict the number of lives that would be saved 
over the next year. Construct a 99% confidence interval for your answer. 


. The drinking age in New Jersey is 21. Suppose that New Jersey lowered 


its drinking age to 19. Use the results in column (5) to predict the change 
in the number of traffic fatalities in the next year. Construct a 95% con- 
fidence interval for your answer. 


. Suppose real income per capita in New Jersey increases by 3% in the 


next year. Use the results in column (6) to predict the change in the 
number of traffic fatalities in the next year. Construct a 95% confidence 
interval for your answer. 


. How should standard errors be clustered in the regressions in columns 


(2) through (7)? 


. How should minimum drinking age be included in the regressions? 


Should it enter as a continuous variable or as a series of indicator vari- 
ables? Be specific about the information you use to assess this question. 


10.2 Consider the binary variable version of the fixed effects model in Equation 


(10.11) except with an additional regressor, D1; that is, let 


Yı = Bo + BiXin + YDli + y2D2) + ++ + YnDnhi + lip 


. Suppose that n = 3. Show that the binary regressors and the “constant” 


regressor are perfectly multicollinear; that is, express one of the variables 
D1,, D2;, D3;, and Xo,;, as a perfect linear function of the others, where 
Xoi = 1 for all i,t. 


. Show the result in (a) for general n. 


. What will happen if you try to estimate the coefficients of the regression 


by OLS? 


10.3 Section 9.2 gave a list of five potential threats to the internal validity of a 


regression study. Apply that list to the empirical analysis in Section 10.6 and 


thereby draw conclusions about its internal validity. 


10.4 Using the regression in Equation (10.11), what are the slope and intercept for 


a. 


b. 


Entity 1 in time period 1? 
Entity 1 in time period 3? 
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10.5 


10.6 


10.7 


10.8 


10.9 


10.10 


10.11 


c. Entity 3 in time period 1? 
d. Entity 3 in time period 3? 


Consider the model with a single regressor. This model also can be written as 
Ya = Bo + BiX i + 62B2, + +++ + 67BT; + y2D2) + +++ + YnDNi + Uin 


where B2, = lift = 2 and 0 otherwise, D2; = 1 ifi = 2 and 0 otherwise, and 
so forth. How are the coefficients (6p, 65,..., ÔT, Y2, - - - , Yn) related to the 
coefficients (a1,...,Q@,,A1,-.-,Ar)? 


Do the fixed effects regression assumptions in Key Concept 10.3 imply that 
COV (Vip Vis) = O fort # sin Equation (10.28)? Explain. 


Suppose a researcher believes that the occurrence of natural disasters such 
as earthquakes leads to increased activity in the construction industry. He 
decides to collect province-level data on employment in the construction 
industry of an earthquake-prone country, like Japan, and regress this vari- 
able on an indicator variable that equals 1 if an earthquake took place in that 
province in the last five years. 


a. Should the researcher include province fixed effects in order to control 
for location-specific characteristics of the labor market? 


b. What can the researcher to control for location effects? 


Consider observations ( Y;,, X;,) from the linear panel data model 
Yı = Xubi + a; + Ait + Uin 


wheret = 1,..., T;i = 1,...,mjanda; + Atis an unobserved entity-specific 
time trend. How would you estimate 64? 


a. In the fixed effects regression model, are the fixed entity effects, a;, 
consistently estimated as n —> © with T fixed? (Hint: Analyze the 
model with no X's: Y, = a; + Uj) 


b. If is large (say, = 2000) but T is small (say, T = 4), do you think that 
the estimated values of a; are approximately normally distributed? Why 
or why not? (Hint: Analyze the model Y, = a; + uj.) 


A researcher wants to estimate the determinants of annual earnings—age, 
gender, schooling, union status, occupation, and sector of employment. He 
has been told that if he collects panel data on a large number of randomly 
chosen individuals over time, he will be able to regress annual earnings on 
these determinant variables while using fixed effects to control for individual- 
specific time-invariant characteristics. What estimation problems is he likely 
to run into if he uses this strategy. 


Let P” denote the entity-demeaned estimator given in Equation (10.22), 
and let peA denote the “before and after” estimator without an intercept, so 
that BP4 = [37 (Xp — Xa) (Yp — Ya) ]/[2%1(Xe — Xu)?]. Show that, if 
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T = 2, ĝP™ = Ge [Hint: Use the definition of X;, before Equation (10.22) 
to show that Xj, = —3(Xp — Xa) and Xp = 3(Xp — Xu).] 


Empirical Exercises 


E10.1 Some U.S. states have enacted laws that allow citizens to carry concealed 
weapons. These laws are known as “shall-issue” laws because they instruct 
local authorities to issue a concealed weapons permit to all applicants who 
are citizens, are mentally competent, and have not been convicted of a felony. 
(Some states have some additional restrictions.) Proponents argue that if 
more people carry concealed weapons, crime will decline because criminals 
will be deterred from attacking other people. Opponents argue that crime 
will increase because of accidental or spontaneous use of the weapons. In this 
exercise, you will analyze the effect of concealed weapons laws on violent 
crimes. On the text website, http://www.pearsonglobaleditions.com, you will 
find the data file Guns, which contains a balanced panel of data from the 50 
U.S. states plus the District of Columbia for the years 1977 through 1999.3 A 
detailed description is given in Guns_Description, available on the website. 


a. Estimate (1) a regression of In(vio) against shall and (2) a regression of 
In(vio) against shall, incarc_rate, density, avginc, pop, pb1064, pw1064, 
and pm1029. 


i. Interpret the coefficient on shall in regression (2). Is this estimate 
large or small in a real-world sense? 


ii. Does adding the control variables in regression (2) change the 
estimated effect of a shall-issue law in regression (1) as measured 
by statistical significance? As measured by the real-world significance 
of the estimated coefficient? 


iii. Suggest a variable that varies across states but plausibly varies 
little —or not at all—over time and that could cause omitted variable 
bias in regression (2). 
b. Do the results change when you add fixed state effects? If so, which set 
of regression results is more credible, and why? 


c. Do the results change when you add fixed time effects? If so, which set 
of regression results is more credible, and why? 


d. Repeat the analysis using In(rob) and In(mur) in place of In(vio). 


These data were provided by Professor John Donohue of Stanford University and were used in his paper 
with Ian Ayres, “Shooting Down the ‘More Guns Less Crime’ Hypothesis,” Stanford Law Review, 2003, 
55: 1193-1312. 
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E10.2 


e. In your view, what are the most important remaining threats to the 
internal validity of this regression analysis? 


f. Based on your analysis, what conclusions would you draw about the 
effects of concealed weapons laws on these crime rates? 


Do citizens demand more democracy and political freedom as their incomes 
grow? That is, is democracy a normal good? On the text website, http://www 
-pearsonglobaleditions.com, you will find the data file Income_Democracy, 
which contains a panel data set from 195 countries for the years 1960, 1965,..., 
2000. A detailed description is given in Income_Democracy_Description, 
available on the website.“ The data set contains an index of political freedom/ 
democracy for each country in each year, together with data on each country’s 
income and various demographic controls. (The income and demographic con- 
trols are lagged five years relative to the democracy index to allow time for 
democracy to adjust to changes in these variables.) 


a. Is the data set a balanced panel? Explain. 
b. The index of political freedom/democracy is labeled Dem_ind. 


i. What are the minimum and maximum values of Dem_ind in the data 
set? What are the mean and standard deviation of Dem_ind in the 
data set? What are the 10th, 25th, 50th, 75th, and 90th percentiles of 
its distribution? 

ii. What is the value of Dem_ind for the United States in 2000? 
Averaged over all years in the data set? 


iii. What is the value of Dem_ind for Libya in 2000? Averaged over all 
years in the data set? 


iv. List five countries with an average value of Dem_ind greater than 
0.95; less than 0.10; and between 0.3 and 0.7. 


c. The logarithm of per capita income is labeled Log GDPPC. Regress 
Dem_ind on Log_GDPPC. Use standard errors that are clustered by 
country. 


i. How large is the estimated coefficient on Log GDPPC? Is the 
coefficient statistically significant? 


ii. If per capita income in a country increases by 20%, by how much is 
Dem_ind predicted to increase? What is a 95% confidence interval 
for the prediction? Is the predicted increase in Dem_ind large or 
small? (Explain what you mean by large or small.) 


4 These data were provided by Daron Acemoglu of M.I.T. and were used in his paper with Simon Johnson, 
James Robinson, and Pierre Yared, “Income and Democracy,” American Economic Review, 2008, 98:3, 


808-842. 
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iii. Why is it important to use clustered standard errors for the regres- 
sion? Do the results change if you do not use clustered standard 
errors? 


d. i. Suggest a variable that varies across countries but plausibly varies 
little—or not at all— over time and that could cause omitted variable 
bias in the regression in (c). 

ii. Estimate the regression in (c), allowing for country fixed effects. 
How do your answers to (c)(i) and (c)(ii) change? 

iii. Exclude the data for Azerbaijan, and rerun the regression. Do the 
results change? Why or why not? 

iv. Suggest a variable that varies over time but plausibly varies little — or 
not at all—across countries and that could cause omitted variable 
bias in the regression in (c). 

v. Estimate the regression in (c), allowing for time and country fixed 
effects. How do your answers to (c)(i) and (c)(ii) change? 


vi. There are additional demographic controls in the data set. Should 
these variables be included in the regression? If so, how do the results 
change when they are included? 

e. Based on your analysis, what conclusions do you draw about the effects 
of income on democracy? 


The State Traffic Fatality Data Set 


The data are for the contiguous 48 U.S. states (excluding Alaska and Hawaii), annually for 
1982 through 1988. The traffic fatality rate is the number of traffic deaths in a given state ina 
given year per 10,000 people living in that state in that year. Traffic fatality data were obtained 
from the U.S. Department of Transportation Fatal Accident Reporting System. The beer tax 
(the tax on a case of beer) was obtained from Beer Institute’s Brewers Almanac. The drinking 
age variables in Table 10.1 are binary variables indicating whether the legal drinking age is 
18, 19, or 20. The binary punishment variable in Table 10.1 describes the state’s minimum 
sentencing requirements for an initial drunk driving conviction: This variable equals 1 if the 
state requires jail time or community service and equals 0 otherwise (a lesser punishment). 
Data on the total vehicle miles traveled annually by state were obtained from the Depart- 
ment of Transportation. Personal income data were obtained from the U.S. Bureau of Eco- 
nomic Analysis, and the unemployment rate was obtained from the U.S. Bureau of Labor 
Statistics. 

These data were graciously provided by Professor Christopher J. Ruhm of the Depart- 


ment of Economics at the University of North Carolina. 
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APPENDIX 


10.2 


Standard Errors for Fixed Effects Regression 


This appendix provides formulas for clustered standard errors for fixed effects regression with 


a single regressor. These formulas are extended to multiple regressors in Exercise 19.15. 


The Asymptotic Distribution of the Fixed Effects 
Estimator with Large n 


The fixed effects estimator. The fixed effects estimator of £; is the OLS estimator obtained 


using the entity-demeaned regression of Equation (10.14), in which Y, is regressed on X,, 
where Y; = Y, — Y, X, = Xi — X,Y, = TD 1-1¥, and X = TD }_1X;. The formula 
for the OLS estimator is obtained by replacing X; — X by X, and Y, — Y by Ý, in Equation 
(4.5) and by replacing the single summations in Equation (4.5) by two summations, one over 


entities (i = 1,...,m) and one over time periods (t = 1,..., T),> so 


(10.22) 


The derivation of the sampling distribution of Bi parallels the derivation in Appendix 4.3 of the 
sampling distribution of the OLS estimator with cross-sectional data. First, substitute 
Ý, = BX, + ù, [Equation (10.14)] into the numerator of Equation (10.22) to obtain the panel 
data counterpart of Equation (4.28): 


1 2 E5 

7 nT iW ir 

B, = BP, 4 (af. (10.23) 
foe 


a = 1 T z 1 2 T 7 
VnT(B, — B) = Z , where n; = TÈ Xu and Ox = AÈ X Xi (10.24) 
t=1 n 


{=1í=1 
The scaling factor in Equation (10.24), nT, is the total number of observations. 


Distribution and standard errors when n is large. In most panel data applications, n is 


much larger than 7, which motivates approximating sampling distributions by letting n —> % 


5 The double summation is the extension to double subscripts of a single summation: 


ee A 


y ax = D 


i=1t= 


n ( T 
i=1\t= 


x) 
1 
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while repo T fixed. Under the fixed effects regression assumptions of Key Concept 10.3, 
Ôx — Qy SET ae X asn— %. Also, 7; is i.i.d. over i = 1,...,n (by assumption 
2) with mean 0 (by assumption 1) and variance a (which is finite by assumption 3), so by the 
central limit theorem, V1/n>"—1n; —— N(0, o7,). It follows from Equation (10.24) that 


A o, 
VnT(B, — b) ——> n(o, 5 } (10.25) 
Q% 
From Equation (10.25), the variance of the large-sample distribution of Ĝi is 
2 
A 1 GS 
var (B;) = nT OF (10.26) 


The clustered standard error formula replaces the population moments in Equation (10.26) 


by their sample counterparts: 


2 
A 1 SH 
SE es 
(£) nT? 
i 2 7 
here s% = Ai À 10.2 
where sj nae nye = bi. (10.27) 


where 7); = WT > T Xú iis the sample counterpart of n; [ 7; is n;in Equation (10.24), with 
it x replaced by the fixed effects regression residual ĉ;] and7) = (1/n) >’;_,7;.The final equal- 
ity in Equation (10.27) arises because 7 = 0, which in turn follows from the residuals and 
regressors being uncorrelated [Equation (4.32)]. Note that så is just the sample variance of ù; 
[see Equation (3.7)]. 

The estimator s% is a consistent estimator of of as n —> ©, even if there is heteroskedasticity or 
autocorrelation (Exercise 18.15); thus the clustered standard error in Equation (10.27) is 
heteroskedasticity- and autocorrelation-robust. Because the clustered standard error is consistent, the 
t-statistic testing B, = £1 has a standard normal distribution under the null hypothesis as n > ~. 

All the foregoing results apply if there are multiple regressors. In addition, if n is large, 
then the F-statistic testing q restrictions (computed using the clustered variance formula) has 


its usual asymptotic F,.. distribution. 


Why isn’t the usual heteroskedasticity-robust estimator of Chapter 5 valid for 
panel data? There are two reasons. The most important reason is that the heteroskedasticity- 
robust estimator of Chapter 5 does not allow for serial correlation within a cluster. Recall that, 
for two random variables U and V, var(U + V) = var(U) + var(V) + 2cov(U, V). The 
variance 7; in Equation (10.24) therefore can be written as the sum of variances plus covari- 


ances. Let Yy = X; ig then 


var (7;) ) = va JES, n) = 4 = ra Va + Va +-+ Tir) 


= var(%a 1) + var(Vj) +--+ var(V;r) 


+ 2cov(Va, Va) +++: + 2cov(Vir-1, Vir) ]. (10.28) 
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The heteroskedasticity-robust variance formula of Chapter 5 misses all the covariances in the 
final part of Equation (10.28), so if there is serial correlation, the usual heteroskedasticity- 
robust variance estimator is inconsistent. 

The second reason is that if T is small, the estimation of the fixed effects introduces bias 
into the heteroskedasticity-robust variance estimator. This problem does not arise in cross- 
sectional regression. 

The one case in which the usual heteroskedasticity-robust standard errors can be used 
with panel data is with fixed effects regression with T = 2 observations. In this case, fixed 
effects regression is equivalent to the differences regression in Section 10.2, and 
heteroskedasticity-robust and clustered standard errors are equivalent. 

For empirical examples showing the importance of using clustered standard errors in 


economic panel data, see Bertrand, Duflo, and Mullainathan (2004). 


Extensions: Other applications of clustered standard errors. In some cases, u; might 
be correlated across entities. For example, in a study of earnings, suppose the sampling scheme 
selects families by simple random sampling, then tracks all siblings within a family. Because 
the omitted factors that enter the error term could have common elements for siblings, it is not 
reasonable to assume that the errors are independent for siblings (even though they are inde- 
pendent across families). 

In the siblings example, families are natural clusters, or groupings, of observations, where 
Uj, is correlated within the cluster but not across clusters. The derivation leading to Equation 
(10.27) can be modified to allow for clusters across entities (for example, families) or across 
both entities and time, as long as there are many clusters. 

Clustered standard errors also apply in some applications with cross-sectional data when 
collection schemes other than simple random sampling are used. For example, suppose cross- 
sectional student-level data on test scores and student characteristics are obtained by first 
randomly sampling classrooms, then collecting data on all students within a classroom. Because 
the classrooms are randomly sampled, errors would be uncorrelated for students from differ- 
ent classrooms. However, the errors might be correlated for students within the same class- 
room, so clustered standard errors would be appropriate, with the clustering done at the 
classroom level. 


For additional discussion of clustered standard errors, see Cameron and Miller (2015). 


Distribution and Standard Errors When n Is Small 


If n is small and T is large, then it remains possible to use clustered standard errors; however, 
t-statistics need to be compared with critical values from the ¢,,_; tables, and the F-statistic 


testing q restrictions needs to be compared to the F, critical value multiplied by 


ae 
(n — 1)/(n — q).These distributions are valid under the e in Key Concept 10.3, 
plus some additional assumptions on the joint distribution of X; and u; over time within an 
entity. Although the validity of the t-distribution in cross-sectional regression requires nor- 
mality and homoskedasticity of the regression errors (Section 5.6), neither requirement is 
needed to justify using the t-distribution with clustered standard errors in panel data when T 


is large. 
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To see why the clustered t-statistic has a ¢,,_; distribution when n is small and T is large, 
even if u; is neither normally distributed nor homoskedastic, first note that if T is large, then 
under additional assumptions, n; in Equation (10.24) will obey a central limit theorem, so 
Ni a N(0, 07). (The additional assumptions required for this result are substantial and 
technical, and we defer further discussion of them to our treatment of time series data in 
Chapter 15.) Thus, if T is large, then VnT( Êi — Bı) in Equation (10.24) is a scaled average of 
the n normal random variables ņ;. Moreover, the clustered formula så in Equation (10.27) is 
the usual formula for the sample variance, and if it could be computed using n,, then 
(n-1 )s%/ o? would have a x2_, distribution, so the t-statistic would have a f,,_; distribution 
[see Section 3.6]. Using the residuals to compute 7; and s4 does not change this conclusion. 
In the case of multiple regressors, analogous reasoning leads to the conclusion that the 
F-statistic testing q restrictions, computed using the cluster variance estimator, is distributed 
as = a) a ear 


q = 4is (£ = 1) x 4.53 = 6.80, where 4.53 is the 5% critical value from the F} distribution 


given in Appendix Table 5B.] Note that, as n increases, the t,_; and (p= 4) Fyn- distribu- 


6 


[For example, the 5% critical value for this F-statistic when n = 10 and 


tions approach the usual standard normal and F}, .. distributions. 
If both n and T are small, then, in general, Bi will not be normally distributed, and clus- 


tered standard errors will not provide reliable inference. 


€ Not all software implements clustered standard errors using the ¢,,_, and (=a) &, n—q distributions that 


apply if n is small, so you should check how your software implements and treats clustered standard errors. 
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1 Dependent Variable 


wo people, identical but for their race, walk into a bank and apply for a mortgage, 

a large loan so that each can buy an identical house. Does the bank treat them the 
same way? Are they both equally likely to have their mortgage application accepted? 
By law, they must receive identical treatment. But whether they actually do is a matter 
of great concern among bank regulators. 

Loans are made and denied for many legitimate reasons. For example, if the 
proposed loan payments take up most or all of the applicant’s monthly income, a loan 
officer might justifiably deny the loan. Also, even loan officers are human and they can 
make honest mistakes, so the denial of a single minority applicant does not prove 
anything about discrimination. Many studies of discrimination thus look for statistical 
evidence of discrimination, that is, evidence contained in large data sets showing that 
whites and minorities are treated differently. 

But how, precisely, should one check for statistical evidence of discrimination 
in the mortgage market? A start is to compare the fraction of minority and white 
applicants who were denied a mortgage. In the data examined in this chapter, 
gathered from mortgage applications in 1990 in the Boston, Massachusetts, area, 28% 
of black applicants were denied mortgages but only 9% of white applicants were 
denied. But this comparison does not really answer the question that opened this 
chapter because the black applicants and the white applicants were not necessarily 
“identical but for their race.” Instead, we need a method for comparing rates of denial, 
holding other applicant characteristics constant. 

This sounds like a job for multiple regression analysis—and it is, but with a twist. 
The twist is that the dependent variable—whether the applicant is denied—is binary. 
In Part Il, we regularly used binary variables as regressors, and they caused no 
particular problems. But when the dependent variable is binary, things are more 
difficult: What does it mean to fit a line to a dependent variable that can take on 
only two values, 0 and 1? 

The answer to this question is to interpret the regression function as a conditional 
probability. This interpretation is discussed in Section 11.1, and it allows us to apply 
the multiple regression models from Part II to binary dependent variables. Section 11.1 
goes over this “linear probability model.” But the predicted probability interpretation 
also suggests that alternative, nonlinear regression models can do a better job 
modeling these probabilities. These methods, called “probit” and “logit” regression, are 
discussed in Section 11.2. Section 11.3, which is optional, discusses the method used 
to estimate the coefficients of the probit and logit regressions, the method of 


11.1 
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maximum likelihood estimation. In Section 11.4, we apply these methods to the 
Boston mortgage application data set to see whether there is evidence of racial bias 
in mortgage lending. 

The binary dependent variable considered in this chapter is an example of a 
dependent variable with a limited range; in other words, it is a limited dependent 
variable. Models for other types of limited dependent variables—for example, 
dependent variables that take on multiple discrete values—are surveyed in 
Appendix 11.3. 


Binary Dependent Variables 
and the Linear Probability Model 


Whether a mortgage application is accepted or denied is one example of a binary 
variable. Many other important questions also concern binary outcomes. What is the 
effect of a tuition subsidy on an individual’s decision to go to college? What deter- 
mines whether a teenager takes up smoking? What determines whether a country 
receives foreign aid? What determines whether a job applicant is successful? In all 
these examples, the outcome of interest is binary: The student does or does not go to 
college, the teenager does or does not take up smoking, a country does or does not 
receive foreign aid, the applicant does or does not get a job. 

This section discusses what distinguishes regression with a binary dependent 
variable from regression with a continuous dependent variable and then turns to the 
simplest model to use with binary dependent variables, the linear probability model. 


Binary Dependent Variables 


The application examined in this chapter is whether race is a factor in denying a 
mortgage application; the binary dependent variable is whether a mortgage applica- 
tion is denied. The data are a subset of a larger data set compiled by researchers at 
the Federal Reserve Bank of Boston under the Home Mortgage Disclosure Act 
(HMDA) and relate to mortgage applications filed in the Boston, Massachusetts, 
area in 1990. The Boston HMDA data are described in Appendix 11.1. 

Mortgage applications are complicated. During the period covered by these data, 
the decision to approve a loan application typically was made by a bank loan officer. 
The loan officer must assess whether the applicant will make his or her loan pay- 
ments. One important piece of information is the size of the required loan payments 
relative to the applicant’s income. As anyone who has borrowed money knows, it is 
much easier to make payments that are 10% of your income than 50%! We therefore 
begin by looking at the relationship between two variables: the binary dependent 
variable deny, which equals 1 if the mortgage application was denied and equals 0 if 
it was accepted, and the continuous variable P/I ratio, which is the ratio of the appli- 
cant’s anticipated total monthly loan payments to his or her monthly income. 
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| FIGURE 11.1 | Scatterplot of Mortgage Application Denial and the Payment-to-Income Ratio 
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Figure 11.1 presents a scatterplot of deny versus P/I ratio for 127 of the 2380 
observations in the data set. (The scatterplot is easier to read using this subset of the 
data.) This scatterplot looks different from the scatterplots of Part II because the 
variable deny is binary. Still, it seems to show a relationship between deny and P/I 
ratio: Few applicants with a payment-to-income ratio less than 0.3 have their 
application denied, but most applicants with a payment-to-income ratio exceeding 
0.4 are denied. 

This positive relationship between P/I ratio and deny (the higher the P/I ratio, 
the greater the fraction of denials) is summarized in Figure 11.1 by the OLS regres- 
sion line estimated using these 127 observations. As usual, this line plots the pre- 
dicted value of deny as a function of the regressor, the payment-to-income ratio. For 
example, when P/I ratio = 0.3, the predicted value of deny is 0.20. But what, pre- 
cisely, does it mean for the predicted value of the binary variable deny to be 0.20? 

The key to answering this question—and more generally to understanding 
regression with a binary dependent variable —is to interpret the regression as model- 
ing the probability that the dependent variable equals 1. Thus the predicted value of 
0.20 is interpreted as meaning that, when P/I ratio is 0.3, the probability of denial is 
estimated to be 20%. Said differently, if there were many applications with 
PJI ratio = 0.3, then 20% of them would be denied. 

This interpretation follows from two facts. First, from Part II, the population regres- 
Xp). Second, 
from Section 2.2, if Y is a 0-1 binary variable, its expected value (or mean) is the prob- 
ability that Y = 1;thatis, E(Y) =0 X Pr(Y = 0) +1 X Pr(Y=1) =Pr(Y =1). 
In the regression context, the expected value is conditional on the value of the 


sion function is the expected value of Y given the regressors, E(Y|Xj,..., 


regressors, so the probability is conditional on X. Thus for a binary variable, 
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E(Y|X,...,X,) = Pr(Y = 1|X,..., X;,). In short, for a binary dependent vari- 
able, the predicted value from the population regression is the probability that Y = 1 
given X. 

The linear multiple regression model applied to a binary dependent variable is 
called the linear probability model: linear because it is a straight line and probability 
model because it models the probability that the dependent variable equals 1 (in our 
example, the probability of loan denial). 


The Linear Probability Model 


The linear probability model is the name for the multiple regression model of Part II 
when the dependent variable is binary rather than continuous. Because the dependent 
variable Y is binary, the population regression function corresponds to the probabil- 
ity that the dependent variable equals 1 given X. The population coefficient B, on a 
regressor X is the change in the probability that Y = 1 associated with a unit change 
in X. Similarly, the OLS predicted value, Ê, computed using the estimated regression 
function, is the predicted probability that the dependent variable equals 1, and the 
OLS estimator ĝ; estimates the change in the probability that Y = 1 associated with 
a unit change in X. 

Almost all of the tools of Part II carry over to the linear probability model. The 
coefficients can be estimated by OLS. Ninety-five percent confidence intervals can 
be formed as + 1.96 standard errors, hypotheses concerning several coefficients can 
be tested using the F-statistic discussed in Chapter 7, and interactions between vari- 
ables can be modeled using the methods of Section 8.3. Because the errors of the 
linear probability model are always heteroskedastic (Exercise 11.8), it is essential that 
heteroskedasticity-robust standard errors be used for inference. 

One tool that does not carry over is the R°. When the dependent variable is con- 
tinuous, it is possible to imagine a situation in which the R? equals 1: All the data lie 
exactly on the regression line. This is impossible when the dependent variable is 
binary unless the regressors are also binary. Accordingly, the R? is not a particularly 
useful statistic here. We return to measures of fit in the next section. 

The linear probability model is summarized in Key Concept 11.1. 


Application to the Boston HMDA data. The OLS regression of the binary depen- 
dent variable, deny, against the payment-to-income ratio, P/I ratio, estimated using 
all 2380 observations in our data set is 


deny = —0.080 + 0.604 P/I ratio. 


(0.032) (0.098) (11.1) 


The estimated coefficient on P/I ratio is positive, and the population coefficient is 
statistically significantly different from 0 at the 1% level (the t-statistic is 6.13). Thus 
applicants with higher debt payments as a fraction of income are more likely to have 
their application denied. This coefficient can be used to compute the predicted 
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The Linear Probability Model 


11.1 


The linear probability model is the linear multiple regression model, 
Yi = Po + BX + PX +++ + BX + u; (11.2) 


applied to a binary dependent variable Y, Because Y is binary, E (Y | X1, X, ..., Xp) = 
Pr(Y = 1|X,, X,..., X;), so for the linear probability model, 


Pr(Y = || DG Caner, = Bo + BX, F BoX> ap ooo oP BX. 


The regression coefficient B; is the difference in the probability that Y = 1 asso- 
ciated with a unit difference in X4, holding constant the other regressors, and so 
forth for By, ..., Bg. The regression coefficients can be estimated by OLS, and the 
usual (heteroskedasticity-robust) OLS standard errors can be used for confidence 
intervals and hypothesis tests. 


change in the probability of denial given a change in the regressor. For example, 
according to Equation (11.1), if P/I ratio increases by 0.1, the probability of denial 
increases by 0.604 Xx 0.1 = 0.060—that is, by 6.0 percentage points. 

The estimated linear probability model in Equation (11.1) can be used to com- 
pute predicted denial probabilities as a function of P/I ratio. For example, if projected 
debt payments are 30% of an applicant’s income, P/I ratio is 0.3, and the predicted 
value from Equation (11.1) is —0.080 + 0.604 x 0.3 = 0.101. That is, according to 
this linear probability model, an applicant whose projected debt payments are 30% 
of income has a probability of 10.1% that his or her application will be denied. [This 
is different from the probability of 20% based on the regression line in Figure 11.1 
because that line was estimated using only 127 of the 2380 observations used to esti- 
mate Equation (11.1).] 

What is the effect of race on the probability of denial, holding constant the P/I 
ratio? To keep things simple, we focus on differences between black applicants and 
white applicants. To estimate the effect of race, holding constant P/I ratio, we aug- 
ment Equation (11.1) with a binary regressor that equals 1 if the applicant is black 
and equals 0 if the applicant is white. The estimated linear probability model is 


deny = —0.091 + 0.559 P/I ratio + 0.177 black. (11.3) 
(0.029) (0.089) (0.025) 


The coefficient on black, 0.177, indicates that an African American applicant has a 
177% higher probability of having a mortgage application denied than a white 
applicant, holding constant their payment-to-income ratio. This coefficient is signifi- 
cant at the 1% level (the t-statistic is 711). 


bia 
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Taken literally, this estimate suggests that there might be racial bias in mortgage 
decisions, but such a conclusion would be premature. Although the payment-to- 
income ratio plays a role in the loan officer’s decision, so do many other factors, such 
as the applicant’s earning potential and his or her credit history. If any of these vari- 
ables is correlated with the regressors black given the P/I ratio, its omission from 
Equation (11.3) will cause omitted variable bias. Thus we must defer any conclusions 
about discrimination in mortgage lending until we complete the more thorough anal- 
ysis in Section 11.3. 


Shortcomings of the linear probability model. The linearity that makes the linear 
probability model easy to use is also its major flaw. Because probabilities cannot 
exceed 1, the effect on the probability that Y = 1 of a given change in X must be 
nonlinear: Although a change in P/I ratio from 0.3 to 0.4 might have a large effect on 
the probability of denial, once P/I ratio is so large that the loan is very likely to be 
denied, increasing P/I ratio further will have little effect. In contrast, in the linear prob- 
ability model, the effect of a given change in P/I ratio is constant, which leads to pre- 
dicted probabilities in Figure 11.1 that drop below 0 for very low values of P/I ratio 
and exceed 1 for high values! But this is nonsense: A probability cannot be less than 
0 or greater than 1. This nonsensical feature is an inevitable consequence of the linear 
regression. To address this problem, we introduce new nonlinear models specifically 
designed for binary dependent variables, the probit and logit regression models. 


Probit and Logit Regression 


Probit and logit! regression are nonlinear regression models specifically designed for 
binary dependent variables. Because a regression with a binary dependent variable 
Y models the probability that Y = 1,it makes sense to adopt a nonlinear formulation 
that forces the predicted values to be between 0 and 1. Because cumulative probabil- 
ity distribution functions (c.d.f.’s) produce probabilities between 0 and 1 (Section 2.1), 
they are used in logit and probit regressions. Probit regression uses the standard 
normal c.d.f. Logit regression, also called logistic regression, uses the logistic c.d.f. 


Probit Regression 


Probit regression with a single regressor. The probit regression model with a single 
regressor X is 


Pr(Y = 1|X) = (b + BX), (11.4) 


where ® is the cumulative standard normal distribution function (tabulated in 
Appendix Table 1). 


‘Pronounced pr6-bit and 16-jit. 
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For example, suppose that Y is the binary mortgage denial variable (deny), X is 
the payment-to-income ratio (P/I ratio), By = —2, and B, = 3. What then is the 
probability of denial if P/I ratio = 0.4? According to Equation (11.4), this probability 
is B( By) + P/I ratio) = ®(-2 + 3P/I ratio) = ®(—2 + 3 X 0.4) = &(—-0.8). 
According to the cumulative normal distribution table (Appendix Table 1), 
®(—0.8) = Pr(Z = —0.8) = 21.2%. That is, when P/I ratio is 0.4, the predicted 
probability that the application will be denied is 21.2%, computed using the probit 
model with the coefficients By = —2 and 6, = 3. 

In the probit model, the term By + B,X plays the role of “z” in the cumulative 
standard normal distribution table in Appendix Table 1. Thus the calculation in the 
previous paragraph can, equivalently, be done by first computing the “z-value,” 
z = fo + BX = -2 + 3 x 0.4 = —0.8, and then looking up the probability in the 
tail of the normal distribution to the left of z = —0.8, which is 21.2%. 

The probit coefficient £; in Equation (11.4) is the difference in the z-value associ- 
ated with a unit difference in X. If 8, is positive, a greater value for X increases the 
z-value and thus increases the probability that Y = 1;if 6; is negative, a greater value 
for X decreases the probability that Y = 1. Although the effect of X on the z-value 
is linear, its effect on the probability is nonlinear. Thus in practice the easiest way to 
interpret the coefficients of a probit model is to compute the predicted probability, 
or the change in the predicted probability, for one or more values of the regressors. 
When there is just one regressor, the predicted probability can be plotted as a func- 
tion of X. 

Figure 11.2 plots the estimated regression function produced by the probit 
regression of deny on P/I ratio for the 127 observations in the scatterplot. The 
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estimated probit regression function has a stretched “S” shape: It is nearly 0 and flat 
for small values of P/I ratio, it turns and increases for intermediate values, and it flat- 
tens out again and is nearly 1 for large values. For small values of the payment-to- 
income ratio, the probability of denial is small. For example, for P/I ratio = 0.2, the 
estimated probability of denial based on the estimated probit function in Figure 11.2 
is Pr(deny = 1|P/I ratio = 0.2) = 2.1%.When P/I ratio = 0.3, the estimated prob- 
ability of denial is 16.1%. When P/I ratio = 0.4, the probability of denial increases 
sharply to 51.9%, and when P/I ratio = 0.6, the denial probability is 98.3%. Accord- 
ing to this estimated probit model, for applicants with high payment-to-income ratios, 
the probability of denial is nearly 1. 


Probit regression with multiple regressors. In all the regression problems we have 
studied so far, leaving out a determinant of Y that is correlated with the included 
regressors results in omitted variable bias. Probit regression is no exception. In linear 
regression, the solution is to include the additional variable as a regressor. This is also 
the solution to omitted variable bias in probit regression. 

The probit model with multiple regressors extends the single-regressor probit 
model by adding regressors to compute the z-value. Accordingly, the probit popula- 
tion regression model with two regressors, X; and X, is 


Pr(Y = 1|X, X) = (b + BX, + hX). (11.5) 
For example, suppose that By) = —1.6, B; = 2,and B, = 0.5. If X, = 0.4 and X = 1, 


the z-value isz = —1.6 + 2 X 0.4 + 0.5 x 1 = —0.3.So the probability that Y = 1 
given X, = 0.4 and X = lis Pr(Y = 1|X, = 0.4,X, = 1) = ®(—-0.3) = 38%. 


Effect of a change in X. In general, the regression model can be used to determine 
the expected change in Y arising from a change in X. When Y is binary, its conditional 
expectation is the conditional probability that it equals 1, so the expected change in 
Y arising from a change in X is the change in the probability that Y = 1. 

Recall from Section 8.1 that, when the population regression function is a non- 
linear function of X, this expected change is estimated in three steps: First, com- 
pute the predicted value at the original value of X using the estimated regression 
function; next, compute the predicted value at the changed value of X, X + AX; 
finally, compute the difference between the two predicted values. This procedure 
is summarized in Key Concept 8.1. As emphasized in Section 8.1, this method 
always works for computing predicted effects of a change in X, no matter how 
complicated the nonlinear model. When applied to the probit model, the method 
of Key Concept 8.1 yields the estimated effect on the probability that Y = 1 of a 
change in X. 

The probit regression model, predicted probabilities, and estimated effects are 
summarized in Key Concept 11.2. 
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The Probit Model, Predicted Probabilities, 
and Estimated Effects 


The population probit model with multiple regressors is 
Be = Ae Xe) = OR Bea eG) LE 


where the dependent variable Y is binary, ® is the cumulative standard normal 
distribution function, and X;, X>,and so on are regressors. The model is best interpreted 
by computing predicted probabilities and the effect of a change in a regressor. 

The predicted probability that Y = 1, given values of Xj, X5,..., Xp, is cal- 
culated by computing the z-value, z = By + B,X; + B2X +--+ + B,X;,, and then 
looking up this z-value in the normal distribution table (Appendix Table 1). 

The coefficient £; is the difference in the z-value arising from a unit difference 
in X4, holding constant X>,..., Xx. 

The effect on the predicted probability of a change in a regressor is computed 
by (1) computing the predicted probability for the initial value of the regressor, 
(2) computing the predicted probability for the new or changed value of the 
regressor, and (3) taking their difference. 


11.2 


Application to the mortgage data. As an illustration, we fit a probit model to the 
2380 observations in our data set on mortgage denial (deny) and the payment-to- 
income ratio (P/I ratio): 


ear oe 
Pr(deny = 1|P/I ratio) = ®(—2.19 + 2.97 P/I ratio). (11.7) 
(0.16) (0.47) 


The estimated coefficients of —2.19 and 2.97 are difficult to interpret because they 
affect the probability of denial via the z-value. Indeed, the only things that can be 
readily concluded from the estimated probit regression in Equation (11.7) are that 
the payment-to-income ratio is positively related to probability of denial (the 
coefficient on P/I ratio is positive) and that this relationship is statistically significant 
(t = 2.97 /0.47 = 6.32). 

What is the change in the predicted probability that an application will be denied when 
the payment-to-income ratio increases from 0.3 to 0.4? To answer this question, we follow 
the procedure in Key Concept 8.1: Compute the probability of denial for P/I ratio = 0.3 
and for P/I ratio = 0.4, and then compute the difference. The probability of denial when 
P/I ratio = 0.3 is ®(—2.19 + 2.97 X 0.3) = ©(-1.30) = 0.097. The probability of 
denial when P/I ratio = 0.4 is ®(—2.19 + 2.97 x 0.4) = ®(—-1.00) = 0.159. The esti- 
mated change in the probability of denial is 0.159 — 0.097 = 0.062. That is, an 
increase in the payment-to-income ratio from 0.3 to 0.4 is associated with an increase 
in the probability of denial of 6.2 percentage points, from 9.7% to 15.9%. 
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Because the probit regression function is nonlinear, the effect of a change in X 
depends on the starting value of X. For example, if P/I ratio = 0.5, the estimated denial 
probability based on Equation (11.7) is ®( —2.19 + 2.97 x 0.5) = ®(—0.71) = 0.239. 
Thus the change in the predicted probability when P/I ratio increases from 0.4 to 0.5 
is 0.239 — 0.159, or 8.0 percentage points, larger than the increase of 6.2 percentage 
points when P/I ratio increases from 0.3 to 0.4. 

What is the effect of race on the probability of mortgage denial, holding constant 
the payment-to-income ratio? To estimate this effect, we estimate a probit regression 
with both P/I ratio and black as regressors: 


Pr (deny = 1|P/I ratio, black) = ®(—2.26 + 2.74 P/I ratio + 0.71 black). (11.8) 
(0.16) (0.44) (0.083 ) 


Again, the values of the coefficients are difficult to interpret, but the sign and statisti- 
cal significance are not. The coefficient on black is positive, indicating that an African 
American applicant has a higher probability of denial than a white applicant, holding 
constant their payment-to-income ratio. This coefficient is statistically significant at 
the 1% level (the t-statistic on the coefficient multiplying black is 8.55). For a white 
applicant with P/I ratio = 0.3, the predicted denial probability is 75%, while for a 
black applicant with P/I ratio = 0.3, it is 23.3%; the difference in denial probabilities 
between these two hypothetical applicants is 15.8 percentage points. 


Estimation of the probit coefficients. The probit coefficients reported here were 
estimated using the method of maximum likelihood, which produces efficient (mini- 
mum variance) estimators in a wide variety of applications, including regression with 
a binary dependent variable. The maximum likelihood estimator is consistent and 
normally distributed in large samples, so f-statistics and confidence intervals for the 
coefficients can be constructed in the usual way. 

Regression software for estimating probit models typically uses maximum likeli- 
hood estimation, so this is a simple method to apply in practice. Standard errors 
produced by such software can be used in the same way as the standard errors of 
regression coefficients; for example, a 95% confidence interval for the true probit 
coefficient can be constructed as the estimated coefficient + 1.96 standard errors. 
Similarly, F-statistics computed using maximum likelihood estimators can be used to 
test joint hypotheses. Maximum likelihood estimation is discussed further in 
Section 11.3, with additional details given in Appendix 11.2. 


Logit Regression 

The logit regression model. The logit regression model is similar to the probit 
regression model except that the cumulative standard normal distribution function ® 
in Equation (11.6) is replaced by the cumulative standard logistic distribution function, 
which we denote by F. Logit regression is summarized in Key Concept 11.3. The logistic 


402 CHAPTER 11 Regression with a Binary Dependent Variable 


Logit Regression 


11.3 


The population logit model of the binary dependent variable Y with multiple 
regressors is 
RAY = 1|X, X, S06 , Xx) = F(Bo =F BX, F BoX> ce oe Oe BXx) 


= 1 (11.9) 
1 + e (Bot BiX1 + BoX2+ ++ > +BeXx) ” 


Logit regression is similar to probit regression except that the cumulative distribu- 
tion function is different. 


cumulative distribution function has a specific functional form, defined in terms of the 
exponential function, which is given as the final expression in Equation (11.9). 

As with probit, the logit coefficients are best interpreted by computing predicted 
probabilities and differences in predicted probabilities. 

The coefficients of the logit model can be estimated by maximum likelihood. The 
maximum likelihood estimator is consistent and normally distributed in large samples, so 
t-statistics and confidence intervals for the coefficients can be constructed in the usual way. 

The logit and probit regression functions are similar. This is illustrated in 
Figure 11.3, which graphs the probit and logit regression functions for the dependent 
variable deny and the single regressor P/I ratio, estimated by maximum likelihood 
using the same 127 observations as in Figures 11.1 and 11.2. The differences between 
the two functions are small. 
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Historically, the main motivation for logit regression was that the logistic cumu- 
lative distribution function could be computed faster than the normal cumulative 
distribution function. With the advent of more powerful computers, this distinction 
is no longer important. 


Application to the Boston HMDA data. A logit regression of deny against P/I ratio and 
black, using the 2380 observations in the data set, yields the estimated regression function 


Pr (deny = 1|P/Iratio, black) = F(—4.13 + 5.37P/I ratio + 1.27black). (11.10) 
(0.35) (0.96) (0.15) 


The coefficient on black is positive and statistically significant at the 1% level (the 
t-statistic is 8.47). The predicted denial probability of a white applicant with 
PJI ratio = 03 is 1/[1 + ee ee) = 1/[1 | = 0.074, or 
7.4%. The predicted denial probability of an African American applicant with 
PJI ratio = 0.3 is 1/[1 + e!*] = 0.222, or 22.2%, so the difference between the 
two probabilities is 14.8 percentage points. 


Comparing the Linear Probability, Probit, 
and Logit Models 


All three models — linear probability, probit, and logit—are just approximations to 
the unknown population regression function E(Y|X) = Pr(Y = 1|X). The linear 
probability model is easiest to use and to interpret, but it cannot capture the nonlin- 
ear nature of the true population regression function. Probit and logit regressions 
model this nonlinearity in the probabilities, but their regression coefficients are more 
difficult to interpret. So which should you use in practice? 

There is no one right answer, and different researchers use different models. 
Probit and logit regressions frequently produce similar results. For example, accord- 
ing to the estimated probit model in Equation (11.8), the difference in denial prob- 
abilities between a black applicant and a white applicant with P/I ratio = 0.3 was 
estimated to be 15.8 percentage points, whereas the logit estimate of this gap, based 
on Equation (11.10), was 14.9 percentage points. For practical purposes, the two esti- 
mates are very similar. One way to choose between logit and probit is to pick the 
method that is easier to use in your statistical software. 

The linear probability model provides the least sensible approximation to the 
nonlinear population regression function. Even so, in some data sets there may be 
few extreme values of the regressors, in which case the linear probability model still 
can provide an adequate approximation. In the denial probability regression in 
Equation (11.3), the estimated black/white gap from the linear probability model is 
17.7 percentage points, larger than the probit and logit estimates but still qualitatively 
similar. The only way to know this, however, is to estimate both a linear and a non- 
linear model and to compare their predicted probabilities. 
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11.3 Estimation and Inference in the Logit 


and Probit Models? 


The nonlinear models studied in Sections 8.2 and 8.3 are nonlinear functions of the 
independent variables but are linear functions of the unknown coefficients (parame- 
ters). Consequently, the unknown coefficients of those nonlinear regression functions 
can be estimated by OLS. In contrast, the probit and logit regression functions are non- 
linear functions of the coefficients. That is, the probit coefficients Bp, 61, . - - , B in Equa- 
tion (11.6) appear inside the cumulative standard normal distribution function ®, and 
the logit coefficients in Equation (11.9) appear inside the cumulative standard logistic 
distribution function F. Because the population regression function is a nonlinear func- 
tion of the coefficients Bo, 1, - - - , By, those coefficients cannot be estimated by OLS. 

This section provides an introduction to the standard method for estimation of 
probit and logit coefficients, maximum likelihood; additional mathematical details 
are given in Appendix 11.2. Because it is built into modern statistical software, maxi- 
mum likelihood estimation of the probit and logit coefficients is easy in practice. The 
theory of maximum likelihood estimation, however, is more complicated than the 
theory of least squares. We therefore first discuss another estimation method, non- 
linear least squares, before turning to maximum likelihood. 


Nonlinear Least Squares Estimation 


Nonlinear least squares is a general method for estimating the unknown parameters 
of a regression function when, like the probit coefficients, those parameters enter the 
population regression function nonlinearly. The nonlinear least squares estimator, 
which was introduced in Appendix 8.1, extends the OLS estimator to regression func- 
tions that are nonlinear functions of the parameters. Like OLS, nonlinear least 
squares finds the values of the parameters that minimize the sum of squared predic- 
tion mistakes produced by the model. 

To be concrete, consider the nonlinear least squares estimator of the parameters 
of the probit model. The conditional expectation of Y given the X’s is 
E(Y|X,...,X,) = Pr(Y = 1|X,..., Xk) = (Bo + BX ++: + BX). Esti- 
mation by nonlinear least squares fits this conditional expectation function, which is 
a nonlinear function of the parameters, to the dependent variable. That is, the non- 
linear least squares estimator of the probit coefficients is the values of bo, . . . , by that 
minimize the sum of squared prediction mistakes: 


n 
> LY; = (bo + bX; oe oe bX) 17- (11.11) 
i=1 
The nonlinear least squares estimator shares two key properties with the OLS esti- 
mator in linear regression: It is consistent (the probability that it is close to the true 


?This section contains more advanced material that can be skipped without loss of continuity. 
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value approaches 1 as the sample size gets large), and it is normally distributed in 
large samples. There are, however, estimators that have a smaller variance than the 
nonlinear least squares estimator; that is, the nonlinear least squares estimator is 
inefficient. For this reason, the nonlinear least squares estimator of the probit coef- 
ficients is rarely used in practice, and instead the parameters are estimated by maxi- 
mum likelihood. 


Maximum Likelihood Estimation 


The likelihood function is the joint probability distribution of the data, treated as a 
function of the unknown coefficients. The maximum likelihood estimator (MLE) of 
the unknown coefficients consists of the values of the coefficients that maximize the 
likelihood function. Because the MLE chooses the unknown coefficients to maxi- 
mize the likelihood function, which is in turn the joint probability distribution, in 
effect the MLE chooses the values of the parameters to maximize the probability of 
drawing the data that are actually observed. In this sense, the MLEs are the param- 
eter values “most likely” to have produced the data. 

To illustrate maximum likelihood estimation, consider two i.i.d. observations, Y; 
and Y>, on a binary dependent variable with no regressors. Thus Y is a Bernoulli 
random variable, and the only unknown parameter to estimate is the probability p 
that Y = 1, which is also the mean of Y. 

To obtain the maximum likelihood estimator, we need an expression for the 
likelihood function, which in turn requires an expression for the joint probability 
distribution of the data. The joint probability distribution of the two observations Y, 
and Y; is Pr(Y, = y,¥5 = y2). Because Y, and Y, are independently distributed, 
the joint distribution is the product of the individual distributions [Equation (2.24)], 
so Pr(Y, = y, V = yo) = Pr(Y, = y,) Pr(Y = y2). The Bernoulli distribution 
can be summarized in the formula Pr( Y = y) = p’(1 — p)!~¥: When y = 1, 
Pr(Y = 1) = p'(1 — p)’ = p,andwhen y = 0,Pr(Y = 0) = p®(1 — p)! =1-p. 
Thus the joint probability distribution of Yı and Y, is Pr(Y, = yı, 
Y = y) = (Pap) a ep (1 - p) 0t, 

The likelihood function is the joint probability distribution, treated as a function 


of the unknown coefficients. For n = 2 i.i.d. observations on Bernoulli random vari- 
ables, the likelihood function is 


F; Yo Ya) = p (A — p) 4%), (11.12) 


The maximun likelihood estimator of p is the value of p that maximizes the likeli- 
hood function in Equation (11.12). As with all maximization or minimization prob- 
lems, this can be done by trial and error; that is, you can try different values of p and 
compute the likelihood f(p; Yı, Y2) until you are satisfied that you have maximized 
this function. In this example, however, maximizing the likelihood function using 
calculus produces a simple formula for the MLE: The MLE is p = }(Y, + Y2). 
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In other words, the MLE of p is just the sample average! In fact, for general n, the 
MLE ĵ of the Bernoulli probability p is the sample average; that is, P = Y (this is 
shown in Appendix 11.2). In this example, the MLE is the usual estimator of p, the 
fraction of times Y; = 1 in the sample. 

This example is similar to the problem of estimating the unknown coefficients of 
the probit and logit regression models. In those models, the success probability p is 
not constant but rather depends on X; that is, it is the success probability conditional 
on X, which is given in Equation (11.6) for the probit model and Equation (11.9) for 
the logit model. Thus the probit and logit likelihood functions are similar to the likeli- 
hood function in Equation (11.12) except that the success probability varies from one 
observation to the next (because it depends on X;). Expressions for the probit and 
logit likelihood functions are given in Appendix 11.2. 

Like the nonlinear least squares estimator, the MLE is consistent and normally 
distributed in large samples. Because regression software commonly computes the 
MLE of the probit coefficients, this estimator is easy to use in practice. All the esti- 
mated probit and logit coefficients reported in this chapter are MLEs. 


Statistical inference based on the MLE. Because the MLE is normally distributed in 
large samples, statistical inference about the probit and logit coefficients based on 
the MLE proceeds in the same way as inference about the linear regression function 
coefficients based on the OLS estimator. That is, hypothesis tests are performed using 
the t-statistic, and 95% confidence intervals are formed as +1.96 standard errors. 
Tests of joint hypotheses on multiple coefficients use the F-statistic in a way similar 
to that discussed in Chapter 7 for the linear regression model. All of this is com- 
pletely analogous to statistical inference in the linear regression model. 

An important practical point is that some statistical software reports tests of joint 
hypotheses using the F-statistic, while other software uses the chi-squared statistic. The 
chi-squared statisticis q X F,where q is the number of restrictions being tested. Because 
the F-statistic is, under the null hypothesis, distributed as Xa /q in large samples, g X F 
is distributed as Xa in large samples. Because the two approaches differ only in whether 
they divide by q, they produce identical inferences, but you need to know which 
approach is implemented in your software so that you use the correct critical values. 


Measures of Fit 


In Section 11.1, it was mentioned that the R? is a poor measure of fit for the linear 
probability model. This is also true for probit and logit regression. Two measures of 
fit for models with binary dependent variables are the fraction correctly predicted 
and the pseudo-R’. The fraction correctly predicted uses the following rule: If Y, = 1 
and the predicted probability exceeds 50% or if Y; = 0 and the predicted probability 
is less than 50%, then Y; is said to be correctly predicted. Otherwise, Y; is said to be 
incorrectly predicted. The fraction correctly predicted is the fraction of the n observa- 
tions Y;,..., Y, that are correctly predicted. 
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An advantage of this measure of fit is that it is easy to understand. A 
disadvantage is that it does not reflect the quality of the prediction: If Y; = 1, the 
observation is treated as correctly predicted whether the predicted probability is 
51% or 90%. 

The pseudo-R? measures the fit of the model using the likelihood function. 
Because the MLE maximizes the likelihood function, adding another regressor to a 
probit or logit model increases the value of the maximized likelihood, just like adding 
a regressor necessarily reduces the sum of squared residuals in linear regression by 
OLS. This suggests measuring the quality of fit of a probit model by comparing values 
of the maximized likelihood function with all the regressors to the value of the likeli- 
hood with none. This is, in fact, what the pseudo-R” does. A formula for the pseudo-R? 
is given in Appendix 11.2. 


Application to the Boston HMDA Data 


The regressions of the previous two sections indicated that denial rates were higher 
for black than white applicants, holding constant their payment-to-income ratio. 
Loan officers, however, legitimately weigh many factors when deciding on a mort- 
gage application, and if any of those other factors differ systematically by race, the 
estimators considered so far have omitted variable bias. 

In this section, we take a closer look at whether there is statistical evidence of 
discrimination in the Boston HMDA data. Specifically, our objective is to estimate 
the effect of race on the probability of denial, holding constant those applicant char- 
acteristics that a loan officer might legally consider when deciding on a mortgage 
application. 

The most important variables available to loan officers through the mortgage 
applications in the Boston HMDA data set are listed in Table 11.1; these are the 
variables we will focus on in our empirical models of loan decisions. The first two 
variables are direct measures of the financial burden the proposed loan would 
place on the applicant, measured in terms of his or her income. The first of these is 
the P/I ratio; the second is the ratio of housing-related expenses to income. The 
next variable is the size of the loan, relative to the assessed value of the home; if 
the loan-to-value ratio is nearly 1, the bank might have trouble recouping the full 
amount of the loan if the applicant defaults on the loan and the bank forecloses. 
The final three financial variables summarize the applicant’s credit history. If an 
applicant has been unreliable paying off debts in the past, the loan officer legiti- 
mately might worry about the applicant’s ability or desire to make mortgage pay- 
ments in the future. The three variables measure different types of credit histories, 
which the loan officer might weigh differently. The first concerns consumer credit, 
such as credit card debt; the second is previous mortgage payment history; and the 
third measures credit problems so severe that they appeared in a public legal 
record, such as filing for bankruptcy. 
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Variable Definition Sample Average 
Financial Variables 

P/I ratio Ratio of total monthly debt payments to total monthly income 0.331 
housing expense-to-income ratio Ratio of monthly housing expenses to total monthly income 0.255 
loan-to-value ratio Ratio of size of loan to assessed value of property 0.738 


consumer credit score 


mortgage credit score 


public bad credit record 


Additional Applicant Characteristics 
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1 if no “slow” payments or delinquencies 2.1 
2 if one or two slow payments or delinquencies 

3 if more than two slow payments 

4 if insufficient credit history for determination 

5 if delinquent credit history with payments 60 days overdue 

6 if delinquent credit history with payments 90 days overdue 


1 if no late mortgage payments 1.7 
2 if no mortgage payment history 

3 if one or two late mortgage payments 

4 if more than two late mortgage payments 

1 


if any public record of credit problems (bankruptcy, charge- 0.074 
offs, collection actions) 
0 otherwise 


denied mortgage insurance 1 if applicant applied for mortgage insurance and was denied, 0.020 
0 otherwise 

self-employed 1 if self-employed, 0 otherwise 0.116 
single 1 if applicant reported being single, 0 otherwise 0.393 
high school diploma 1 if applicant graduated from high school, 0 otherwise 0.984 
unemployment rate 1989 Massachusetts unemployment rate in the applicant’s industry 3.8 

condominium 1 if unit is a condominium, 0 otherwise 0.288 
black 1 if applicant is black, 0 if white 0.142 
deny 1 if mortgage application denied, 0 otherwise 0.120 

o a 


Table 11.1 also lists some other variables relevant to the loan officer’s decision. 
Sometimes the applicant must apply for private mortgage insurance. The loan offi- 
cer knows whether that application was denied, and that denial would weigh nega- 
tively with the loan officer. The next four variables, which concern the applicant’s 
employment status, marital status, and educational attainment, as well as the unem- 
ployment rate in the applicant’s industry, relate to the prospective ability of the appli- 
cant to repay. In the event of foreclosure, characteristics of the property are relevant 
as well, and the next variable indicates whether the property is a condominium. The 
final two variables in Table 11.1 are whether the applicant is black or white and 


3Mortgage insurance is an insurance policy under which the insurance company makes the monthly pay- 
ment to the bank if the borrower defaults. During the period of this study, if the loan-to-value ratio exceeds 
80%, the applicant typically was required to buy mortgage insurance. 
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whether the application was denied or accepted. In these data, 14.2% of applicants 
are black, and 12.0% of applications are denied. 

Table 11.2 presents regression results based on these variables. The base specifi- 
cations, reported in columns (1) through (3), include the financial variables in 
Table 11.1 plus the variables indicating whether private mortgage insurance was 
denied and whether the applicant is self-employed. In the 1990s, loan officers com- 
monly used thresholds, or cutoff values, for the loan-to-value ratio, so the base speci- 
fication for that variable uses binary variables for whether the loan-to-value ratio is 
high (=0.95), medium (between 0.8 and 0.95), or low (<0.8; this case is omitted to 
avoid perfect multicollinearity). The regressors in the first three columns are similar 
to those in the base specification considered by the Federal Reserve Bank of Boston 
researchers in their original analysis of these data.* The regressions in columns (1) 
through (3) differ only in how the denial probability is modeled, using a linear prob- 
ability model, a logit model, and a probit model, respectively. 

Because the coefficients of the logit and probit models in columns (2)-(6) are not 
directly interpretable, the table reports standard errors but not confidence intervals. 
In addition, because the aim of these regressions is to approximate the loan officers’ 
decision rule, it is of interest to know whether individual variables — especially the 
applicant’s race—enter that decision rule. Thus the table reports, through asterisks, 
whether the test that the coefficient is 0 rejects at the 5% or 1% significance level. 

Because the regression in column (1) is a linear probability model, its coefficients 
are estimated changes in predicted probabilities arising from a unit change in the inde- 
pendent variable. Accordingly, an increase in P/I ratio of 0.1 is estimated to increase 
the probability of denial by 4.5 percentage points (the coefficient on P/Jratio in column 
(1) is 0.449, and 0.449 x 0.1 = 0.045). Similarly, having a high loan-to-value ratio 
increases the probability of denial: A loan-to-value ratio exceeding 95% is associated 
with an 18.9 percentage point increase (the coefficient is 0.189) in the denial probabil- 
ity, relative to the omitted case of a loan-to-value ratio less than 80%, holding the other 
variables in column (1) constant. Applicants with a poor credit rating also have a more 
difficult time getting a loan, all else being constant, although interestingly the coeffi- 
cient on consumer credit is statistically significant but the coefficient on mortgage 
credit is not. Applicants with a public record of credit problems, such as filing for bank- 
ruptcy, have much greater difficulty obtaining a loan: All else equal, a public bad credit 
record is estimated to increase the probability of denial by 0.197, or 19.7 percentage 
points. Being denied private mortgage insurance is estimated to be virtually decisive: 
The estimated coefficient of 0.702 means that being denied mortgage insurance 
increases your chance of being denied a mortgage by 70.2 percentage points, all else 


‘The difference between the regressors in columns (1) through (3) and those in Munnell et al. (1996), 
table 2 (1), is that Munnell et al. include additional indicators for the location of the home and the identity 
of the lender, data that are not publicly available; an indicator for a multifamily home, which is irrelevant 
here because our subset focuses on single-family homes; and net wealth, which we omit because this vari- 
able has a few very large positive and negative values and thus risks making the results sensitive to a few 
specific outlier observations. 
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uy ,\:} 54 4p Mortgage Denial Regressions Using the Boston HMDA Data 
Dependent variable: deny = 1 if mortgage application is denied, = 0 if accepted; 2380 observations. 
Regression Model LPM Logit Probit Probit Probit Probit 
Regressor (1) (2) (3) (4) (5) (6) 
black 0.084** 0.688** 0.389** 0.371** 0.363** 0.246 
(0.023) (0.182) (0.098) (0.099) (0.100) (0.448) 
P/I ratio 0.449** 4.76** 2.44** 2.46** 2.62** 2.01% 
(0.114) (1.33) (0.61) (0.60) (0.61) (0.66) 
housing expense-to-income ratio —0.048 —0.11 —0.18 —0.30 —0.50 —0.54 
(0.110) (1.29) (0.68) (0.68) (0.70) (0.74) 
medium loan-to-value ratio 0.031* 0.46** 0.21** 0.22** 0.22** 0.22** 
(0.80 = loan-value ratio = 0.95) (0.013) (0.16) (0.08) (0.08) (0.08) (0.08) 
high loan-to-value ratio (loan-value 0.189** 1.49** 0.79** 0,79** 0.84** 0.79** 
ratio > 0.95) (0.050) (0.32) (0.18) (0.18) (0.18) (0.18) 
consumer credit score 0.031** 0.29** 0.15** 0.16** 0.34** 0.16** 
(0.005) (0.04) (0.02) (0.02) (0.11) (0.02) 
mortgage credit score 0.021 0.28* 0.15* 0.11 0.16 0.11 
(0.011) (0.14) (0.07) (0.08) (0.10) (0.08) 
public bad credit record 0.197** 1.23** 0.70** 0.70** 0.72** 0.70** 
(0.035) (0.20) (0.12) (0.12) (0.12) (0.12) 
denied mortgage insurance 0.702** 4.55** 2.56** 2.59** 2.59"* 2.59% 
(0.045) (0.57) (0.30) (0.29) (0.30) (0.29) 
self-employed 0.060** 0.67** 0.36** 0.35** 0.34** 0.35** 
(0.021) (0.21) (0.11) (0.11) (0.11) (0.11) 
single 0.23** 0.23** 0.23** 
(0.08) (0.08) (0.08) 
high school diploma —0.61** —0.60* —0.62** 
(0.23) (0.24) (0.23) 
unemployment rate 0.03 0.03 0.03 
(0.02) (0.02) (0.02) 
condominium —0.05 
(0.09) 
black X P/I ratio —0.58 
(1.47) 
black X housing expense-to-income 1.23 
ratio (1.69) 
additional credit rating indicator no no no no yes no 
variables 
constant =0.183**. —5,71** —3.04** —2.5/** —2.90** —2.54** 
(0.028) (0.48) (0.23) (0.34) (0.39) (0.35) 
Ns J 


(continued) 
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a =“ 
(Table 11.2 continued) 
F-Statistics and p-Values Testing Exclusion of Groups of Variables 
(1) (2) (3) (4) (5) (6) 
applicant single; high school diploma; 5.85 5:22 5:79 
industry unemployment rate (< 0.001) (0.001) (< 0.001) 
additional credit rating indicator 1:22 
variables (0.291) 
race interactions and black 4.96 
(0.002) 
race interactions only 0.27 
(0.766) 
difference in predicted probability 8.4% 6.0% 71% 6.6% 6.3% 6.5% 
of denial, white vs. black (percent- 
age points) 
These regressions were estimated using the n = 2380 observations in the Boston HMDA data set described in Appendix 11.1. 
The linear probability model was estimated by OLS, and probit and logit regressions were estimated by maximum likelihood. 
Standard errors are given in parentheses under the coefficients, and p-values are given in parentheses under the F-statistics. 
The change in predicted probability in the final row was computed for a hypothetical applicant whose values of the regressors, 
other than race, equal the sample mean. Individual coefficients are statistically significant at the *5% or **1% level. 


Xe 


equal. Of the nine variables (other than race) in the regression, the coefficients on all 
but two are statistically significant at the 5% level, which is consistent with loan offi- 
cers’ considering many factors when they make their decisions. 

The coefficient on black in regression (1) is 0.084, indicating that the difference 
in denial probabilities for black and white applicants is 8.4 percentage points, holding 
constant the other variables in the regression. This is statistically significant at the 1% 
significance level (t = 3.65). 

The logit and probit estimates reported in columns (2) and (3) yield similar conclu- 
sions. In the logit and probit regressions, eight of the nine coefficients on variables other 
than race are individually statistically significantly different from 0 at the 5% level, and 
the coefficient on black is statistically significant at the 1% level. As discussed in 
Section 11.2, because these models are nonlinear, specific values of all the regressors 
must be chosen to compute the difference in predicted probabilities for white applicants 
and black applicants. A conventional way to make this choice is to consider an “average” 
applicant who has the sample average values of all the regressors other than race. The 
final row in Table 11.2 reports this estimated difference in probabilities, evaluated for 
this average applicant. The estimated racial differentials are similar to each other: 
8.4 percentage points for the linear probability model [column (1)], 6.0 percentage 
points for the logit model [column (2)], and 71 percentage points for the probit model 
[column (3)]. These estimated race effects and the coefficients on black are less than in 
the regressions of the previous sections, in which the only regressors were P/I ratio and 
black, indicating that those earlier estimates had omitted variable bias. 

The regressions in columns (4) through (6) investigate the sensitivity of the 
results in column (3) to changes in the regression specification. Column (4) modifies 


412 


CHAPTER 11 Regression with a Binary Dependent Variable 


column (3) by including additional applicant characteristics. These characteristics 
help to predict whether the loan is denied; for example, having at least a high school 
diploma reduces the probability of denial (the estimate is negative, and the coeffi- 
cient is statistically significant at the 1% level). However, controlling for these per- 
sonal characteristics does not change the estimated coefficient on black or the 
estimated difference in denial probabilities (6.6%) in an important way. 

Column (5) breaks out the six consumer credit categories and four mortgage 
credit categories to test the null hypothesis that these two variables enter linearly; 
this regression also adds a variable indicating whether the property is a condomin- 
ium. The null hypothesis that the credit rating variables enter the expression for the 
z-value linearly is not rejected, nor is the condominium indicator significant, at the 
5% level. Most importantly, the estimated racial difference in denial probabilities 
(6.3%) is essentially the same as in columns (3) and (4). 

Column (6) examines whether there are interactions. Are different standards 
applied to evaluating the payment-to-income and housing expense-to-income ratios 
for black applicants versus white applicants? The answer appears to be no: The interac- 
tion terms are not jointly statistically significant at the 5% level. However, race contin- 
ues to have a significant effect, because the race indicator and the interaction terms are 
jointly statistically significant at the 1% level. Again, the estimated racial difference in 
denial probabilities (6.5%) is essentially the same as in the other probit regressions. 

In all six specifications, the effect of race on the denial probability, holding other 
applicant characteristics constant, is statistically significant at the 1% level. The esti- 
mated difference in denial probabilities between black applicants and white appli- 
cants ranges from 6.0 percentage points to 8.4 percentage points. 

One way to assess whether this differential is large or small is to return to a variation 
on the question posed at the beginning of this chapter. Suppose two individuals apply for 
a mortgage, one white and one black, but otherwise having the same values of the other 
independent variables in regression (3); specifically, aside from race, the values of the other 
variables in regression (3) are the sample average values in the HMDA data set. The white 
applicant faces a 74% chance of denial, but the black applicant faces a 14.5% chance of 
denial. The estimated racial difference in denial probabilities, 71 percentage points, means 
that the black applicant is nearly twice as likely to be denied as the white applicant. 

The results in Table 11.2 (and in the original Boston Fed study) provide statistical 
evidence of racial patterns in mortgage denial that, by law, ought not be there. This 
evidence played an important role in spurring policy changes by bank regulators.° 
But economists love a good argument, and not surprisingly these results have also 
stimulated a vigorous debate. 

Because the suggestion that there is (or was) racial discrimination in lending is 
charged, we briefly review some points of this debate. In so doing, it is useful to adopt 
the framework of Chapter 9— that is, to consider the internal and external validity of 


These policy shifts include changes in the way that fair lending examinations were done by federal bank 
regulators, changes in inquiries made by the U.S. Department of Justice, and enhanced education programs 
for banks and other home loan origination companies. 
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the results in Table 11.2, which are representative of previous analyses of the 
Boston HMDA data. A number of the criticisms made of the original Federal Reserve 
Bank of Boston study concern internal validity: possible errors in the data, alternative 
nonlinear functional forms, additional interactions, and so forth. The original data 
were subjected to a careful audit, some errors were found, and the results reported 
here (and in the final published Boston Fed study) are based on the “cleaned” data 
set. Estimation of other specifications —different functional forms and/or additional 
regressors—also produces estimates of racial differentials comparable to those in 
Table 11.2. A potentially more difficult issue of internal validity is whether there is 
relevant nonracial financial information obtained during in-person loan interviews, 
but not recorded on the loan application itself, that is correlated with race; if so, there 
still might be omitted variable bias in the Table 11.2 regressions. Finally, some have 
questioned external validity: Even if there was racial discrimination in Boston in 1990, 
it is wrong to implicate lenders elsewhere today. Moreover, racial discrimination might 
be less likely using modern online applications because the mortgage can be approved 
or denied without a face-to-face meeting. The only way to resolve the question of 


external validity is to consider data from other locations and years.° 


Conclusion 


When the dependent variable Y is binary, the population regression function is the 
probability that Y = 1, conditional on the regressors. Estimation of this population 
regression function entails finding a functional form that does justice to its probabil- 
ity interpretation, estimating the unknown parameters of that function, and inter- 
preting the results. The resulting predicted values are predicted probabilities, and the 
estimated effect of a change in a regressor X is the estimated change in the probabil- 
ity that Y = 1 arising from the change in X. 

A natural way to model the probability that Y = 1 given the regressors is to use a 
cumulative distribution function, where the argument of the c.d.f. depends on the regres- 
sors. Probit regression uses a normal c.d.f. as the regression function, and logit regression 
uses a logistic c.d.f. Because these models are nonlinear functions of the unknown 
parameters, those parameters are more complicated to estimate than linear regression 
coefficients. The standard estimation method is maximum likelihood. In practice, statis- 
tical inference using the maximum likelihood estimates proceeds the same way as it 
does in linear multiple regression; for example, 95% confidence intervals for a coeffi- 
cient are constructed as the estimated coefficient + 1.96 standard errors. 


®If you are interested in further reading on this topic, a good place to start is the symposium on racial 
discrimination and economics in the Spring 1998 issue of the Journal of Economic Perspectives. The article 
in that symposium by Helen Ladd (1998) surveys the evidence and debate on racial discrimination in 
mortgage lending. A more detailed treatment is given in Goering and Wienk (1996). The U.S. mortgage 
market has changed dramatically since the Boston Fed study, including a relaxation of lending standards, 
a bubble in housing prices, the financial crisis of 2008-2009, and a return to tighter lending standards. For 
an introduction to changes in mortgage markets, see Green and Wachter (2008). 
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James Heckman and Daniel McFadden, Nobel Laureates 


T he 2000 Nobel Prize in Economics was 
awarded jointly to two econometricians, James 
J. Heckman of the University of Chicago and Dan- 
iel L. McFadden of the University of California at 
Berkeley, for fundamental contributions to the anal- 
ysis of data on individuals and firms. Much of their 
work addressed difficulties that arise with limited 
dependent variables. 

Heckman was awarded the prize for develop- 
ing tools for handling sample selection. As discussed 
in Section 9.2, sample selection bias occurs when the 
availability of data is influenced by a selection process 
related to the value of the dependent variable. For 
example, suppose you want to estimate the relationship 
between earnings and some regressor, X, using a ran- 
dom sample from the population. If you estimate the 
regression using the subsample of employed workers — 
that is, those reporting positive earnings—the OLS 
estimate could be subject to selection bias. Heckman’s 
solution was to specify a preliminary equation with 
a binary dependent variable indicating whether the 
worker is in or out of the labor force (in or out of the 
subsample) and to treat this equation and the earn- 
ings equation as a system of simultaneous equations. 
This general strategy has been extended to selection 
problems that arise in many fields, ranging from labor 


economics to industrial organization to finance. 


Regression with a Binary Dependent Variable 


McFadden was awarded the prize for develop- 
ing models for analyzing discrete choice data (does 
a high school graduate join the military, go to col- 
lege, or get a job?). He started by considering the 
problem of an individual maximizing the expected 
utility of each possible choice, which could depend 
on observable variables (such as wages, job charac- 
teristics, and family background). He then derived 
models for the individual choice probabilities with 
unknown coefficients, which in turn could be esti- 
mated by maximum likelihood. These models and 
their extensions have proven widely useful in ana- 
lyzing discrete choice data in many fields, including 
labor economics, health economics, and transporta- 
tion economics. 

For more information on these and other Nobel 
laureates in economics, visit the Nobel Foundation 


website, http://www.nobel.se/economics. 
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James J. Heckman Daniel L. McFadden 


Despite its intrinsic nonlinearity, sometimes the population regression function 
can be adequately approximated by a linear probability model—that is, by the 
straight line produced by linear multiple regression. The linear probability model, 
probit regression, and logit regression all give similar bottom-line answers when they 
are applied to the Boston HMDA data: All three methods estimate substantial dif- 
ferences in mortgage denial rates for otherwise similar black applicants and white 
applicants. 

Binary dependent variables are the most common example of limited dependent 
variables, which are dependent variables with a limited range. The final quarter of the 
20th century saw important advances in econometric methods for analyzing other 
limited dependent variables (see the box “James Heckman and Daniel McFadden, 
Nobel Laureates”). Some of these methods are reviewed in Appendix 11.3. 
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Summary 


1. 


When Y is a binary variable, the population regression function shows the 
probability that Y = 1 given the value of the regressors, X), X3,..., Xx. 


2. The linear multiple regression model is called the linear probability model 
when Y is a binary variable because the probability that Y = 1 is a linear func- 
tion of the regressors. 

3. Probit and logit regression models are nonlinear regression models used when 
Y is a binary variable. Unlike the linear probability model, probit and logit 
regressions ensure that the predicted probability that Y = 1 is between 0 and 
1 for all values of X. 

4. Probit regression uses the standard normal cumulative distribution function. 
Logit regression uses the logistic cumulative distribution function. Logit and 
probit coefficients are estimated by maximum likelihood. 

5. The values of coefficients in probit and logit regressions are not easy to inter- 
pret. Changes in the probability that Y = 1 associated with changes in one or 
more of the X’s can be calculated using the general procedure for nonlinear 
models outlined in Key Concept 8.1. 

6. Hypothesis tests on coefficients in the linear probability, logit, and probit mod- 
els are performed using the usual t- and F-statistics. 

Key Terms 

limited dependent variable (393) likelihood function (405) 

linear probability model (395) maximum likelihood estimator 

probit (397) (MLE) (405) 

logit (397) fraction correctly predicted (406) 
logistic regression (397) pseudo-R? (407) 
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Review the Concepts 


11.1 Suppose a linear probability model yields a predicted value of Y that is equal 


to 1.3. Explain why this is nonsensical. 
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Ne 


Dependent Variable: Gov | 


Linear Linear 
Probit Logit Probability Probit Logit Probability Probit 
(1) (2) (3) (4) (5) (6) (7) 

Schooling 0.272 0.551 0.035 0.548 
(0.029) (0.062) (0.003) (0.091) 

Male —0.242 —0.455 —0.050 4.352 
(0.125) (0.234) (0.025) (1.291) 

Male X Schooling —0.344 
(0.096) 

Constant —4.107 —8.146 —0.172 —1.027 cpl les le 0.152 —7.702 
(0.358) (0.800) (0.027) (0.098) (0.179) (0.021) (1.238) 


11.2 In Table 11.2, the estimated coefficient on black is 0.084 in column (1), 0.688 
in column (2), and 0.389 in column (3). In spite of these large differences, all 
three models yield similar estimates of the marginal effect of race on the prob- 
ability of mortgage denial. How can this be? 


11.3 What is maximum likelihood estimation? What are the advantages of using max- 
imum likelihood estimators such as the probit and the logit, instead of the linear 
probability model? How would you choose between the probit and the logit? 


11.4 What measures of fit are typically used to assess binary dependent variable 
regression models? 


Exercises 


Exercises 11.1 through 11.5 are based on the following scenario: Seven hundred 
income-earning individuals from a district were randomly selected and asked whether 
they are government employees (Gov; = 1) or not (Gov; = 0); data were also col- 
lected on their gender (Male; = 1 if male and = 0 if female) and their years of 
schooling (Schooling;, in years). Note, Schooling refers to the number of years of 
education received by people ages 25 and older. The following table summarizes 
several estimated models. 


11.1 Using the results in column (1): 
a. Does the probability of working for the government depend on School- 
ing? Explain. 
b. Friedrich Fiirnrohr has 16 years of schooling. What is the probability that 
he will be employed by the government? 


c. Hans Schneider never went to college (12 years of schooling). What is 
the probability that Hans will get a government job? 


d. The sample included values of Schooling between 0 and 18 years, and 
only five people in the sample had more than 15 years of schooling. 
Giinter Mayer has completed his PhD and has been a student for 


11.2 


11.3 


11.4 


11.5 


11.6 


11.7 


11.8 
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24 years. What is the model’s prediction for the probability that Giinter 
will be employed by the government? Do you think that this prediction 
is reliable? Why or why not? 


Answer (a) through (c) from Exercise 11.1 using the results in column (2). 


. Sketch the predicted probabilities from the probit and logit in columns (1) 


and (2) for values of Schooling between 0 and 18. Are the probit and logit 
models similar? 


Answer (a) through (c) from Exercise 11.1 using the results in column (3). 


. Sketch the predicted probabilities from the probit and linear probability 


in columns (1) and (3) as a function of Schooling for values of Schooling 
between 0 and 18. Do you think that the linear probability is appropriate 
here? Why or why not? 


Using the results in columns (4) through (6): 


a. 


b. 


Compute the estimated probability of being employed by the govern- 
ment for men and for women. 


Are the models in (4) through (6) different? Why or why not? 


Using the results in column (7): 


a. 


Liam Johansson is a man with 10 years of schooling. What is the prob- 
ability that government will employ him? 


. Anneli Karlsson is a woman with 12 years of schooling. What is the prob- 


ability that government will employ her? 


. Does the effect of schooling on government employment depend on 


gender? Explain. 


Use the estimated probit model in Equation (11.8) to answer the following 


questions: 


a. 


A black mortgage applicant has a P/I ratio of 0.35. What is the probabil- 
ity that his application will be denied? 


. Suppose the applicant reduced this ratio to 0.30. What effect would this 


have on his probability of being denied a mortgage? 


c. Repeat (a) and (b) for a white applicant. 


. Does the marginal effect of the P/I ratio on the probability of mortgage 


denial depend on race? Explain. 


Repeat Exercise 11.6 using the logit model in Equation (11.10). Are the logit 


and probit results similar? Explain. 


Consider the linear probability model Y, = By) + BX; + u; and assume that 


E( 


a. 


u| Xi) = 0. 


Show that Pr( Y, = 1|X;) = Bo + BX; 
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b. Show that var(u;|X;) = (Bo + B:X;)[1 — (Bo + bX) ]. [Hint: Review 
Equation (2.7).] 
c. Is u; heteroskedastic? Explain. 


d. (Requires Section 11.3) Derive the likelihood function. 


11.9 Use the estimated linear probability model shown in column (1) of Table 11.2 
to answer the following: 


a. Two applicants, one self-employed and one in salaried employment, 
apply for a mortgage. They have the same values for all the regres- 
sors other than employment status. How much more likely is the self- 
employed applicant to be denied a mortgage? 


b. Construct a 95% confidence interval for your answer to (a). 
c. Think of an important omitted variable that might bias the answer in (a). 
What is it, and how would it bias the results? 
11.10 (Requires Section 11.3 and calculus) Suppose a random variable Y has the 
following probability distribution: Pr(Y = 1) = p, Pr(Y = 2) = q, and 
Pr(Y = 3) = 1 — p — q. A random sample of size n is drawn from this dis- 


tribution, and the random variables are denoted Y}, Y2, ... ,Y,. 


a. Derive the likelihood function for the parameters p and q. 


b. Derive formulas for the MLE of p and q. 
11.11 (Requires Appendix 11.3) State which model you would use for: 


a. A study explaining the number of hours a person spends working in a 
factory during one week. 


b. A study explaining the level of satisfaction (0 through 5) a person gains 
from their job. 


c. A study of consumers’ choices for mode of transport — bus, car, or bicycle. 


d. A study of the number of rainy days in a week. 


Empirical Exercises 


E11.1 In April 2008, the unemployment rate in the United States stood at 5.0%. By 
April 2009, it had increased to 9.0%, and it had increased further, to 10.0%, 
by October 2009. Were some groups of workers more likely to lose their jobs 
than others during the Great Recession? For example, were young workers 
more likely to lose their jobs than middle-aged workers? What about workers 
with a college degree versus those without a degree or women versus men? 
On the text website, http://www.pearsonglobaleditions.com, you will find the 
data file Employment_08_09, which contains a random sample of 5440 work- 
ers who were surveyed in April 2008 and reported that they were employed 
full-time. A detailed description is given in Employment_08_09_Description, 


E11.2 
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available on the website. These workers were surveyed one year later, in 


April 2009, and asked about their employment status (employed, unemployed, or 


out of the labor force). The data set also includes various demographic measures 


for each individual. Use these data to answer the following questions. 


a. 


= © a A 


What fraction of workers in the sample were employed in April 2009? 
Use your answer to compute a 95% confidence interval for the prob- 
ability that a worker was employed in April 2009, conditional on being 
employed in April 2008. 


. Regress Employed on Age and Age’, using a linear probability model. 


i. Based on this regression, was age a statistically significant determi- 
nant of employment in April 2009? 
ii. Is there evidence of a nonlinear effect of age on the probability of 
being employed? 
iii. Compute the predicted probability of employment for a 20-year-old 
worker, a 40-year-old worker, and a 60-year-old worker. 


. Repeat (b) using a probit regression. 
. Repeat (b) using a logit regression. 
. Are there important differences in your answers to (b)-(d)? Explain. 


. The data set includes variables measuring the workers’ educational 


attainment, sex, race, marital status, region of the country, and weekly 
earnings in April 2008. 


i. Construct a table like Table 11.2 to investigate whether the conclu- 
sions on the effect of age on employment from (b)-(d) are affected 
by omitted variable bias. 


ii. Use the regressions in your table to discuss the characteristics of 
workers who were hurt most by the Great Recession. 


. The results in (a)—(f) were based on the probability of employment. 


Workers who are not employed can either be (i) unemployed or 

(ii) out the labor force. Do the conclusions you reached in (a)—(f) also 
hold for workers who became unemployed? (Hint: Use the binary 
variable Unemployed instead of Employed.) 


. These results have covered employment transitions during the Great 


Recession, but what about transitions during normal times? On the text 
website, you will find the data file Employment_06_07, which measures 
the same variables but for the years 2006-2007 Analyze these data and 
comment on the differences in employment transitions during recessions 
and normal times. 


Believe it or not, workers used to be able to smoke inside office buildings. 


Smoking bans were introduced in several areas during the 1990s. Supporters of 


these bans argued that in addition to eliminating the externality of secondhand 
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smoke, they would encourage smokers to quit by reducing their opportunities 
to smoke. In this assignment, you will estimate the effect of workplace smoking 
bans on smoking, using data on a sample of 10,000 U.S. indoor workers from 
1991 to 1993, available on the text website, http://www.pearsonglobaleditions 
-com, in the file Smoking. The data set contains information on whether indi- 
viduals were or were not subject to a workplace smoking ban, whether the 
individuals smoked, and other individual characteristics.’ A detailed descrip- 
tion is given in Smoking Description, available on the website. 


a. Estimate the probability of smoking for (i) all workers, (ii) workers 
affected by workplace smoking bans, and (iii) workers not affected by 
workplace smoking bans. 


b. What is the difference in the probability of smoking between workers 
affected by a workplace smoking ban and workers not affected by a 
workplace smoking ban? Use a linear probability model to determine 
whether this difference is statistically significant. 


c. Estimate a linear probability model with smoker as the dependent 
variable and the following regressors: smkban, female, age, age’, 
hsdrop, hsgrad, colsome, colgrad, black, and hispanic. Compare the 
estimated effect of a smoking ban from this regression with your answer 
from (b). Suggest an explanation, based on the substance of this regression, 
for the change in the estimated effect of a smoking ban between (b) and (c). 


d. Test the hypothesis that the coefficient on smkban is 0 in the population 
version of the regression in (c) against the alternative that it is nonzero, 
at the 5% significance level. 


e. Test the hypothesis that the probability of smoking does not depend on 
the level of education in the regression in (c). Does the probability of 
smoking increase or decrease with the level of education? 


f. Repeat (c)-(e) using a probit model. 
g. Repeat (c)-(e) using a logit model. 


a 


i. Mr. A is white, non-Hispanic, 20 years old, and a high school dropout. 
Using the probit regression and assuming that Mr. A is not subject 
to a workplace smoking ban, calculate the probability that Mr. A 
smokes. Carry out the calculation again, assuming that he is subject 
to a workplace smoking ban. What is the effect of the smoking ban 
on the probability of smoking? 


ii. Repeat (i) for Ms. B, a female, black, 40-year-old college graduate. 
iii. Repeat (i)-(ii) using the linear probability model. 


These data were provided by Professor William Evans of the University of Maryland and were used 
in his paper with Matthew Farrelly and Edward Montgomery, “Do Workplace Smoking Bans Reduce 
Smoking?” American Economic Review, 1999, 89(4): 728-747. 
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iv. Repeat (i)-(ii) using the logit model. 
v. Based on your answers to (i)—(iv), do the logit, probit, and linear 


probability models differ? If they do, which results make most sense? 
Are the estimated effects large in a real-world sense? 


The Boston HMDA Data Set 


The Boston HMDA data set was collected by researchers at the Federal Reserve Bank of 
Boston. The data set combines information from mortgage applications and a follow-up survey 
of the banks and other lending institutions that received these mortgage applications. The data 
pertain to mortgage applications made in 1990 in the greater Boston metropolitan area. The 
full data set has 2925 observations, consisting of all mortgage applications by blacks and His- 
panics plus a random sample of mortgage applications by whites. 

To narrow the scope of the analysis in this chapter, we use a subset of the data for single- 
family residences only (thereby excluding data on multifamily homes) and for black applicants 
and white applicants only (thereby excluding data on applicants from other minority groups). This 
leaves 2380 observations. Definitions of the variables used in this chapter are given in Table 11.1. 

These data were graciously provided to us by Geoffrey Tootell of the Research Depart- 
ment of the Federal Reserve Bank of Boston. More information about this data set, along with 
the conclusions reached by the Federal Reserve Bank of Boston researchers, is available in 
Munnell et al. (1996). 


Maximum Likelihood Estimation 


This appendix provides a brief introduction to maximum likelihood estimation in the context 
of the binary response models discussed in this chapter. We start by deriving the MLE of the 
success probability p for n i.i.d. observations of a Bernoulli random variable. We then turn to 
the probit and logit models and discuss the pseudo-R”. We conclude with a discussion of stan- 


dard errors for predicted probabilities. This appendix uses calculus at two points. 


MLE for n i.i.d. Bernoulli Random Variables 


The first step in computing the MLE is to derive the joint probability distribution. For n i.i.d. 
observations on a Bernoulli random variable, this joint probability distribution is the extension 


of the n = 2 case in Section 11.3 to general n: 
Pr(Y = yi; Y = Yn... 2A — Yn) 


= [p'(1 — p)O-™)] x [ped — py) Fm) ] X -xX [pL = p) 7] 
= pote +) (1 = pee tyn), (11.13) 
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The likelihood function is the joint probability distribution, treated as a function of the unknown 
coefficients. Let S = 7_,¥; then the likelihood function is 


fBernoutti( P; Yi, aaa) Ya) = p° ~ pyr (11.14) 


The MLE of p is the value of p that maximizes the likelihood in Equation (11.14). The likelihood 
function can be maximized using calculus. It is convenient to maximize not the likelihood but 
rather its logarithm (because the logarithm is a strictly increasing function, maximizing the 
likelihood or its logarithm gives the same estimator). The log likelihood is 


SiIn(p) + (n — S)In(1 — p),and the derivative of the log likelihood with respect to p is 


d 
ie [Bernoulli (p; Yi» tee Yn) | 


i 
MIA 
| 


(11.15) 


Setting the derivative in Equation (11.15) to 0 and solving for p yields the MLE p = S/n = Y. 


MLE for the Probit Model 


For the probit model, the probability that Y; = 1, conditional on Xi; ...,Xki, is 
Pi = ®( Bo + BX; + +++ + BeX,;). The conditional probability distribution for the i" obser- 
vation is Pr[ Y; = y,|Xj;,..., Xi] = pYi(1 — p;i)! ™”. Assuming that (Xj;,...,X;;, ¥;) are i.i.d., 


i = 1,...,n, the joint probability distribution of Y,,..., Y,,, conditional on the X’s, is 


Pr(Y, = Vip vena Ya = y,,| Mino. ket = Letra gb) 
= Pr(Y, = yi: |Xi,---, Xa) X +++ X Pr(Ya = Yn| Xin - - -> Xen) 
= p}(1 =p X- X p(l — pa). (11.16) 


The likelihood function is the joint probability distribution, treated as a function of the 
unknown coefficients. It is conventional to consider the logarithm of the likelihood. Accord- 


ingly, the log likelihood function is 


In[ forovit( Bos <- -> Be Y,,.- . i Yn Aipa . sXki i =ar 1) | 


= > Yiln[®(Bo + BX +++ + BeXci) | 


+ X- ¥)In[l — P(o + BX ++ Be Xi), 0117) 


where this expression incorporates the probit formula for the conditional probability, 
Pi = P(By + Bix + +++ + BX). 
The MLE for the probit model maximizes the likelihood function or, equivalently, the 


logarithm of the likelihood function given in Equation (11.17). Because there is no simple 
formula for the MLE, the probit likelihood function must be maximized using a numerical 
algorithm on the computer. 

Under general conditions, maximum likelihood estimators are consistent and have a nor- 


mal sampling distribution in large samples. 
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MLE for the Logit Model 


The likelihood for the logit model is derived in the same way as the likelihood for the probit 
model. The only difference is that the conditional success probability p; for the logit model is 
given by Equation (11.9). Accordingly, the log likelihood of the logit model is given by Equa- 
tion (11.17), with ®( By + BX; + +++ + BkXpi) replaced by [1 + elot PXut Boxart +BXu) J=, 
As with the probit model, there is no simple formula for the MLE of the logit coefficients, so 


the log likelihood must be maximized numerically. 


Pseudo-R? 


The pseudo-R? compares the value of the likelihood of the estimated model to the value of the 
likelihood when none of the X’s are included as regressors. Specifically, the pseudo-R? for the 


probit model is 


In (f probit) 


In (f Bernoulli) : 


pseudo — R? = 1 (11.18) 


where f probi is the value of the maximized probit likelihood (which includes the X’s) and f Bernoulli 


is the value of the maximized Bernoulli likelihood (the probit model excluding all the X’s). 


Standard Errors for Predicted Probabilities 


For simplicity, consider the case of a single regressor in the probit model. Then the predicted 
probability at a fixed value of that regressor,x,is p(x) = ®(BM“ + BV“=x), where BY" and 


BMLE are the MLEs of the two probit coefficients. Because this predicted probability depends 


on the estimators BY“" and BV“£, and because those estimators have a sampling distribution, 
the predicted probability will also have a sampling distribution. 

The variance of the sampling distribution of p(x) is calculated by approximating the 
aye 


function B( GMLE + putty, a nonlinear function of and BML, by a linear function of 


ĝMLE and ĝMLE. Specifically, let 
B(x) = P(Y + BU Fx) = c + a( BY" — Po) + (BY — B), (1119) 


where the constant c and factors a) and a, depend on x and are obtained from calculus. 
[Equation (11.19) is a first-order Taylor series expansion; c = ®( 6p + B,x); and ay and a, are 
the partial derivatives, ay = ô ®( 6o + Bix) /ABo| gv”, gauze and a, = AD(By + Bix) /OBy| gue gyre] 
The variance of p(x) now can be calculated using the approximation in Equation (11.19) and 


the expression for the variance of the sum of two random variables in Equation (2.32): 


IR 


var[c + (YE — Bo) + a(BY“* — Bi) ] 


= abvar(BML®) + a?var(ĝBMLE) + 2aa;cov( XLE, BME). (11.20) 


var[ p(x) ] 


Using Equation (11.20), the standard error of p(x) can be calculated using estimates of the 


variances and covariance of the MLEs. 
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APPENDIX 


11.3 


Other Limited Dependent Variable Models 


This appendix surveys some models for limited dependent variables, other than binary 
variables, found in econometric applications. In most cases, the OLS estimators of the 
parameters of limited dependent variable models are inconsistent, and estimation is rou- 
tinely done using maximum likelihood. There are several advanced references available to 
the reader interested in further details; see, for example, Greene (2018), Ruud (2000), and 
Wooldridge (2010). 


Censored and Truncated Regression Models 


Suppose you have cross-sectional data on car purchases by individuals in a given year. Car 
buyers have positive expenditures, which can reasonably be treated as continuous random 
variables, but nonbuyers spend $0. Thus the distribution of car expenditures is a combination 
of a discrete distribution (at 0) and a continuous distribution. 

Nobel laureate James Tobin developed a useful model for a dependent variable with a 
partly continuous and partly discrete distribution (Tobin, 1958). Tobin suggested modeling the 
i individual in the sample as having a desired level of spending, Yj, that is related to the 
regressors (for example, family size) according to a linear regression model. That is, when there 


is a single regressor, the desired level of spending is 


Y} = h + BX; + upi = 1,...,7. (11.21) 


If Y; (what the consumer wants to spend) exceeds some cutoff, such as the minimum price of 
a car, the consumer buys the car and spends Y, = Y;, which is observed. However, if Y} is less 
than the cutoff, spending of Y, = 0 is observed instead of Y;. 

When Equation (11.21) is estimated using observed expenditures Y; in place of Y;, the 
OLS estimator is inconsistent. Tobin solved this problem by deriving the likelihood func- 
tion using the additional assumption that u; has a normal distribution, and the resulting 
MLE has been used by applied econometricians to analyze many problems in economics. 
In Tobin’s honor, Equation (11.21), combined with the assumption of normal errors, is 
called the tobit regression model. The tobit model is an example of a censored regression 
model, so called because the dependent variable has been “censored” above or below a 


certain cutoff. 


Sample Selection Models 


In the censored regression model, there are data on buyers and nonbuyers, as there would be 
if the data were obtained via simple random sampling of the adult population. If, however, the 


data are collected from sales tax records, then the data would include only buyers: There would 
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be no data at all for nonbuyers. Data in which observations are unavailable above or below a 
threshold (data for buyers only) are called truncated data. The truncated regression model is a 
regression model applied to data in which observations are simply unavailable when the 
dependent variable is above or below a certain cutoff. 

The truncated regression model is an example of a sample selection model, in which the 
selection mechanism (an individual is in the sample by virtue of buying a car) is related to the 
value of the dependent variable (expenditure on a car). As discussed in the box “James Heck- 
man and Daniel McFadden, Nobel Laureates” in Section 11.5, one approach to estimation of 
sample selection models is to develop two equations, one for Y; and one for whether Y; is 
observed. The parameters of the model can then be estimated by maximum likelihood, or, in 
a stepwise procedure, estimating the selection equation first and then estimating the equation 
for Y;. For additional discussion, see Ruud (2000, Chapter 28), Greene (2018, Chapter 19), or 
Wooldridge (2010, Chapter 17). 


Count Data 


Count data arise when the dependent variable is a counting number—for example, the num- 
ber of restaurant meals eaten by a consumer in a week. When these numbers are large, the 
variable can be treated as approximately continuous, but when they are small, the continuous 
approximation is a poor one. The linear regression model, estimated by OLS, can be used for 
count data, even if the number of counts is small. Predicted values from the regression are 
interpreted as the expected value of the dependent variable, conditional on the regressors. So 
when the dependent variable is the number of restaurant meals eaten, a predicted value of 
1.7 means, on average, 1.7 restaurant meals per week. As in the binary regression model, 
however, OLS does not take advantage of the special structure of count data and can yield 
nonsense predictions: for example, —0.2 restaurant meals per week. Just as probit and logit 
eliminate nonsense predictions when the dependent variable is binary, special models do so 
for count data. The two most widely used models are the Poisson and negative binomial 


regression models. 


Ordered Responses 


Ordered response data arise when mutually exclusive qualitative categories have a natural 
ordering, such as obtaining a high school diploma, obtaining some college education (but not 
graduating), or graduating from college. Like count data, ordered response data have a natural 
ordering, but unlike count data, they do not have natural numerical values. 

Because there are no natural numerical values for ordered response data, OLS is inap- 
propriate. Instead, ordered data are often analyzed using a generalization of probit called the 
ordered probit model, in which the probability of each outcome (e.g., a college education), 
conditional on the independent variables (such as parents’ income), is modeled using the 


cumulative normal distribution. 
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Discrete Choice Data 


A discrete choice or multiple choice variable can take on multiple unordered qualitative values. 
One example in economics is the mode of transport chosen by a commuter: She might take 
the subway, ride the bus, drive, or make her way under her own power (walk, bicycle). If we 
were to analyze these choices, the dependent variable would have four possible outcomes 
(subway, bus, car, and human-powered). These outcomes are not ordered in any natural way. 
Instead, the outcomes are a choice among distinct qualitative alternatives. 

The econometric task is to model the probability of choosing the various options given 
various regressors such as individual characteristics (how far the commuter’s house is from 
the subway station) and the characteristics of each option (the price of the subway). As 
discussed in the box in Section 11.5, models for analysis of discrete choice data can be devel- 
oped from principles of utility maximization. Individual choice probabilities can be expressed 
in probit or logit form, and those models are called multinomial probit and multinomial logit 


regression models. 


Instrumental Variables 
2 Regression 


hapter 9 discussed several problems, including omitted variables, errors in 
eo and simultaneous causality, that make the error term correlated with 
the regressor. Omitted variable bias can be addressed directly by including the 
omitted variable in a multiple regression, but this is only feasible if you have data on 
the omitted variable. And sometimes, such as when causality runs both from X to Y 
and from Y to X so that there is simultaneous causality bias, multiple regression simply 
cannot eliminate the bias. If a direct solution to these problems is either infeasible or 
unavailable, a new method is required. 

Instrumental variables (IV) regression is a general way to obtain a 
consistent estimator of the unknown causal coefficients when the regressor, X, is 
correlated with the error term, u. To understand how IV regression works, think 
of the variation in X as having two parts: one part that, for whatever reason, is 
correlated with u (this is the part that causes the problems) and a second part 
that is uncorrelated with u. If you had information that allowed you to isolate 
the second part, you could focus on those variations in X that are uncorrelated 
with u and disregard the variations in X that bias the OLS estimates. This is, in 
fact, what IV regression does. The information about the movements in X that 
are uncorrelated with u is gleaned from one or more additional variables, called 
instrumental variables or simply instruments. Instrumental variables regression 
uses these additional variables as tools or “instruments” to isolate the movements 
in X that are uncorrelated with u, which in turn permits consistent estimation of 
the regression coefficients. 

The first two sections of this chapter describe the mechanics and assumptions 
of IV regression: why IV regression works, what is a valid instrument, and how to 
implement and to interpret the most common IV regression method, two stage 
least squares. The key to successful empirical analysis using instrumental 
variables is finding valid instruments, and Section 12.3 takes up the question of 
how to assess whether a set of instruments is valid. As an illustration, Section 12.4 
uses IV regression to estimate the elasticity of demand for cigarettes. Finally, 
Section 12.5 turns to the difficult question of where valid instruments come from 
in the first place. 
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IPA 


The IV Estimator with a Single Regressor 
and a Single Instrument 


We start with the case of a single regressor, X, which might be correlated with the 
error, u. If X and u are correlated, the OLS estimator is inconsistent; that is, it may 
not be close to the true value of the causal coefficient even when the sample is very 
large [see Equation (6.1)]. As discussed in Section 9.2, this correlation between X and 
u can stem from various sources, including omitted variables, errors in variables 
(measurement errors in the regressors), and simultaneous causality (when causality 
runs “backward” from Y to X as well as “forward” from X to Y). Whatever the source 
of the correlation between X and u, if there is a valid instrumental variable, Z, the 
effect on Y of a unit change in X can be estimated using the instrumental variables 
estimator. 


The IV Model and Assumptions 


Let 6; be the causal effect of X on Y. The model relating the dependent variable Y; 
and regressor X; without any control variables, is 


Y, = bo + BAe tee Tat, (12.1) 


where u; is the error term representing omitted factors that determine Y; If X; and 
u; are correlated, the OLS estimator is inconsistent. Instrumental variables estima- 
tion uses an additional, “instrumental” variable Z to isolate that part of X that is 
uncorrelated with u. 


Endogeneity and exogeneity. Instrumental variables regression has some special- 
ized terminology to distinguish variables that are correlated with the population 
error term u from ones that are not. Variables correlated with the error term are 
called endogenous variables, while variables uncorrelated with the error term are 
called exogenous variables. The historical source of these terms traces to models with 
multiple equations, in which an “endogenous” variable is determined within the 
model, while an “exogenous” variable is determined outside the model. For example, 
Section 9.2 considered the possibility that if low test scores produced decreases in the 
student-teacher ratio because of political intervention and increased funding, causal- 
ity would run both from the student-teacher ratio to test scores and from test scores 
to the student-teacher ratio. This was represented mathematically as a system of two 
simultaneous equations [Equations (9.3) and (9.4)], one for each causal connection. 
As discussed in Section 9.2, because both test scores and the student-teacher ratio 
are determined within the model, both are correlated with the population error term 
u; that is, in this example, both variables are endogenous. In contrast, an exogenous 
variable, which is determined outside the model, is uncorrelated with u. 
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The two conditions for a valid instrument. A valid instrumental variable (“instru- 
ment”) Z must satisfy two conditions, known as the instrument relevance condition 
and the instrument exogeneity condition: 


1. Instrument relevance: corr(Z;, X;) # 0. 
2. Instrument exogeneity: corr(Z;, u;) = 0. 


If an instrument is relevant, then variation in the instrument is related to varia- 
tion in X;. If in addition the instrument is exogenous, then that part of the variation 
of X; captured by the instrumental variable is exogenous. Thus an instrument that is 
relevant and exogenous can capture movements in X; that are exogenous. This 
exogenous variation can in turn be used to estimate the population coefficient 64. 

The two conditions for a valid instrument are vital for instrumental variables 
regression, and we return to them (and their extension to multiple regressors and 
multiple instruments) repeatedly throughout this chapter. 


The Two Stage Least Squares Estimator 


If the instrument Z satisfies the conditions of instrument relevance and exogeneity, the 

coefficient 6, can be estimated using an IV estimator called two stage least squares (TSLS). 

As the name suggests, the two stage least squares estimator is calculated in two stages. The 

first stage decomposes X into two components: a problematic component that may be 

correlated with the regression error and another, problem-free component that is uncor- 

related with the error. The second stage uses the problem-free component to estimate 64. 
The first stage begins with a population regression linking X and Z: 


X; = To T T Zi ag Vi, (12.2) 


where m is the intercept, 7 is the slope, and v; is the error term. This regression pro- 
vides the needed decomposition of X;. One component is 7 + mZ; the part of X; 
that can be predicted by Z;. Because Z; is exogenous, this component of X; is uncor- 
related with u; the error term in Equation (12.1). The other component of X; is v; 
which is the problematic component of X; that is correlated with u;. 

The idea behind TSLS is to use the problem-free component of X;, mo + mZ; 
and to disregard v;. The only complication is that the values of 77 and 7, are unknown, 
so mo + mZ; cannot be calculated. Accordingly, the first stage of TSLS applies OLS 
to Equation (12.2) and uses the predicted value from the OLS regression, 
X, 


| = Ti) + MZ where m and 7 are the OLS estimates. 


The second stage of TSLS is easy: Regress Y; on x using OLS. The resulting 


estimators from the second-stage regression are the TSLS estimators, Be and BP, 


Why Does IV Regression Work? 


Two examples provide some insight into why IV regression solves the problem of 
correlation between X; and u;. 
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When Was Instrumental Variables Regression Invented? 


| nstrumental variables regression was first pro- 

posed as a solution to the simultaneous causa- 
tion problem in econometrics in the appendix to 
Philip G. Wright’s 1928 book, The Tariff on Animal 
and Vegetable Oils. If you want to know how ani- 
mal and vegetable oils were produced, transported 
and sold in the early twentieth century, the first 285 
pages of the book are for you. Econometricians, 
however, will be more interested in Appendix B. The 
appendix provides two derivations of “the method 
of introducing external factors” —what we now call 
the instrumental variables estimator—and uses 
TV regression to estimate the supply and demand 
elasticities for butter and flaxseed oil. Philip was an 
obscure economist with a scant intellectual legacy 
other than this appendix, but his son Sewall went on 
to become a preeminent population geneticist and 
statistician. The invention of IV regression has been 
found to have been a joint intellectual collaboration 
between father and son. To learn more, see Stock 
and Trebbi (2003). 

The use of principles similar to those employed 
in IV regression can be traced further back in time 
to the identification of the origins of an outbreak 
of cholera in mid-nineteenth century London by 
John Snow. Snow wanted to investigate whether 
or not cholera was water-borne, but knew that an 
association between exposure to impure water and 
prevalence of cholera would not provide defini- 
tive proof of causality as the households exposed 
to impure water were often also exposed to a wide 
range of other environmental factors that could 
also have been behind the outbreak of the disease 
(omitted variables). Snow identified that two main 
water companies supplying water to households 
in London drew their water supply from different 


parts of the River Thames, and used this information 


to get around the issue. Lambeth Water Company 
drew their water from above the sewage discharge, 
while Southwark and Vauxhall Company drew their 
water from below the discharge (thereby drawing 
water with greater impurity). Snow argued that the 
households served by the two companies were simi- 
lar except for the purity of the water that they were 
provided. He used this information to employ an 
approach analogous to IV regression with exposure 
to impure water as the endogenous variable and the 
supplying water company as the instrumental vari- 
able. The supplying water company could be consid- 
ered to be relevant to the exposure to impure water 
because of where water was drawn from in the river 
relative to the sewage discharge. The instrument was 
considered exogenous since there was no plausible 
direct effect on cholera and it was uncorrelated 
with other household factors that may have caused 
cholera. John Snow’s work identifying exposure to 
impure water as one of the causes of the outbreak 
of cholera led to him being regarded as one of the 
fathers of epidemiology. This example is explained 
in greater detail in Deaton (1997), Grootendorst 
(2007), Greene (2003), and, of course, Snow (1855). 


Courtesy of Rosalind W. Harris 
Courtesy of Rosalind W. Harris 
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Example 1: Philip Wright's problem. The method of instrumental variables estima- 
tion was first published in 1928 in an appendix to a book written by Philip G. Wright 
(1928), although the key ideas of IV regression were developed collaboratively with 
his son Sewall Wright (see the box “When Was Instrumental Variables Regression 
Invented?”). Philip Wright was concerned with an important economic problem of 
his day: how to set an import tariff (a tax on imported goods) on animal and vegeta- 
ble oils and fats, such as butter and soy oil. In the 1920s, import tariffs were a major 
source of tax revenue for the United States. The key to understanding the economic 
effect of a tariff was having quantitative estimates of the demand and supply curves 
of the goods. Recall that the supply elasticity is the percentage change in the quantity 
supplied arising from a 1% increase in the price and that the demand elasticity is the 
percentage change in the quantity demanded arising from a 1% increase in the price. 
Philip Wright needed estimates of these elasticities of supply and demand. 

To be concrete, consider the problem of estimating the elasticity of demand for 
butter. Recall from Key Concept 8.2 that the coefficient in a linear equation relating 
In(Y;) to In(X;) has the interpretation of the elasticity of Y with respect to X. In 
Wright’s problem, this suggests the demand equation 


In( pater = By + Bint) + ü; (12.3) 


where Q?" is the i” observation on the quantity of butter consumed, P?““" is its 
price, and u; represents other factors that affect demand, such as income and con- 
sumer tastes. In Equation (12.3), a 1% increase in the price of butter yields a £ 
percent change in demand, so f is the demand elasticity. 

Philip Wright had data on total annual butter consumption and its average 
annual price in the United States for 1912 to 1922. It would have been easy to use 
these data to estimate the demand elasticity by applying OLS to Equation (12.3), but 
he had a key insight: Because of the interactions between supply and demand, the 


regressor, In( P?“°"), was likely to be correlated with the error term. 
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To see this, look at Figure 12.1a, which shows the market demand and supply 
curves for butter for three different years. The demand and supply curves for the first 
period are denoted D, and Sj, and the first period’s equilibrium price and quantity 
are determined by their intersection. In year 2, demand increases from D, to D, (say, 
because of an increase in income), and supply decreases from S4 to S (because of an 
increase in the cost of producing butter); the equilibrium price and quantity are 
determined by the intersection of the new supply and demand curves. In year 3, the 
factors affecting demand and supply change again; demand increases again to D3, 
supply increases to $3, and a new equilibrium quantity and price are determined. 
Figure 12.1b shows the equilibrium quantity and price pairs for these three periods 
and for eight subsequent years, where in each year the supply and demand curves are 
subject to shifts associated with factors other than price that affect market supply and 
demand. This scatterplot is like the one that Wright would have seen when he plotted 
his data. As he reasoned, fitting a line to these points by OLS will estimate neither a 
demand curve nor a supply curve because the points have been determined by 
changes in both demand and supply. 

Wright realized that a way to get around this problem was to find some third 
variable that shifted supply but did not shift demand. Figure 12.1c shows what hap- 
pens when such a variable shifts the supply curve but demand remains stable. Now 
all of the equilibrium price and quantity pairs lie on a stable demand curve, and the 
slope of the demand curve is easily estimated. In the instrumental variable formula- 
tion of Wright’s problem, this third variable —the instrumental variable —is corre- 
lated with price (it shifts the supply curve, which leads to a change in price) but is 
uncorrelated with u (the demand curve remains stable). Wright considered several 
potential instrumental variables; one was the weather. For example, below-average 
rainfall in a dairy region could impair grazing and thus reduce butter production at 
a given price (it would shift the supply curve to the left and increase the equilibrium 
price), so dairy-region rainfall satisfies the condition for instrument relevance. But 
dairy-region rainfall should not have a direct influence on the demand for butter, so 
the correlation between dairy-region rainfall and u; would be 0; that is, dairy-region 
rainfall satisfies the condition for instrument exogeneity. 


Example 2: Estimating the effect on test scores of class size. Despite controlling for 
student and district characteristics, the estimates of the effect on test scores of class size 
reported in Part II still might have omitted variable bias resulting from unmeasured 
variables such as learning opportunities outside school or the quality of the teachers. If 
data on these variables, or on suitable control variables, are unavailable, this omitted 
variable bias cannot be addressed by including the variables in the multiple regressions. 

Instrumental variables regression provides an alternative approach to this 
problem. Consider the following hypothetical example: Some California schools 


12.1 


| FIGURE 12.1 | Equilibrium Price and Quantity Data 


(a) Price and quantity are determined by the intersection of 
the supply and demand curves. The equilibrium in the first 
period is determined by the intersection of the demand 
curve D4 and the supply curve $4. Equilibrium in the second 
period is the intersection of D, and Sj, and equilibrium in 
the third period is the intersection of D3 and $3. 


(b) This scatterplot shows equilibrium price and quantity in 
11 different time periods. The demand and supply curves 
are hidden. Can you determine the demand and supply 
curves from the points on the scatterplot? 


(c) When the supply curve shifts from S; to Sy to $3 but the 
demand curve remains at D4, the equilibrium prices and 
quantities trace out the demand curve. 


The IV Estimator with a Single Regressor and a Single Instrument 


Price| Period 2 
equilibrium 
S2 
7 
7 
s A Sı 
53 
“ve” — Period 3 
ane equilibrium 
NG `D; 
Period 1 D, 
equilibrium 
D; 
Quantity 


(a) Demand and supply in three time periods 


Price 


Quantity 


(b) Equilibrium price and quantity for 11 
time periods 


Price 


Quantity 
(c) Equilibrium price and quantity when only 
the supply curve shifts 


433 


434 


CHAPTER 12 Instrumental Variables Regression 


are forced to close for repairs because of a summer earthquake. Districts closest 
to the epicenter are most severely affected. A district with some closed schools 
needs to “double up” its students, temporarily increasing class size. This means 
that distance from the epicenter satisfies the condition for instrument relevance 
because it is correlated with class size. But if distance to the epicenter is unrelated 
to any of the other factors affecting student performance (such as whether the stu- 
dents are still learning English or disruptive effects of the earthquake on student 
performance), then it will be exogenous because it is uncorrelated with the error 
term. Thus the instrumental variable, distance to the epicenter, could be used to 
circumvent omitted variable bias and to estimate the effect of class size on test 
scores. 


The Sampling Distribution of the TSLS Estimator 


The exact distribution of the TSLS estimator in small samples is complicated. 
However, like the OLS estimator, its distribution in large samples is simple: The 
TSLS estimator is consistent and is normally distributed. 


Formula for the TSLS estimator. Although the two stages of TSLS make the 
estimator seem complicated, when there is a single X and a single instrument Z, as 
we assume in this section, there is a simple formula for the TSLS estimator. Let szy 
be the sample covariance between Z and Y, and let szy be the sample covariance 
between Z and X. As shown in Appendix 12.2, the TSLS estimator with a single 
instrument is 


STSLS —_ SZY 
pres = 32. (12.4) 


That is, the TSLS estimator of £; is the ratio of the sample covariance between Z and 
Y to the sample covariance between Z and X. 
Sampling distribution of ĝ1°™ when the sample size is large. The formula in Equa- 
tion (12.4) can be used to show that BTS% is consistent and, in large samples, normally 
distributed. The argument is summarized here, with mathematical details given in 
Appendix 12.3. 

The argument that B/S is consistent combines the assumptions that Z; is 
relevant and exogenous with the consistency of sample covariances for population 


covariances. To begin, note that because Y; = By + BX; + u; in Equation (12.1), 
cov(Z;, Y;) = cov(Z;, By + BX; + u;) = B, cov(Z;, X;) + cov(Z; u;), (12.5) 


where the second equality follows from the properties of covariances [Equation 
(2.34)]. By the instrument exogeneity assumption, cov(Z;,u;) = 0, and by the 
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instrument relevance assumption, cov(Z;, X;) # 0. Thus, if the instrument is valid, 
Equation (12.5) implies that 
cov(Zi, Y; ) 
1= ZAT (12.6) 

That is, the population coefficient 6; is the ratio of the population covariance between 
Z and Y to the population covariance between Z and X. 

As discussed in Section 3.7, the sample covariance is a consistent estimator of the 
population covariance; that is, szy “> cov(Z,;, Y;) and szy > cov(Z;, X;). It 
follows from Equations (12.4) and (12.6) that the TSLS estimator is consistent: 


Brsts — SZY py cov( Z;, Y;) 
i SZX cov( Z;, X;) 


= pi. (12.7) 


The formula in Equation (12.4) also can be used to show that the sampling distribution 
of ĝTSŁS is normal in large samples. The reason is the same as for every other least 
squares estimator we have considered: The TSLS estimator is an average of random 
variables, and when the sample size is large, the central limit theorem tells us that 
averages of random variables are normally distributed. Specifically, the numerator of 
the expression for TS+ in Equation (12.4) is szy = —1,>"_,(Z; — Z)(¥; - Y), 
an average of (Z; — Z)(¥; — Y).A bit of algebra, sketched out in Appendix 12.3, 
shows that because of this averaging, the central limit theorem implies that, in large 


S. 


samples, BIS" hasa sampling distribution that is approximately N ( 64, Fhris), where 


n ; 
pra AA (128) 
PTa [eov(Z,X)} 


Statistical inference using the large-sample distribution. The variance oh rsis can be 
estimated by estimating the variance and covariance terms appearing in Equation 
(12.8), and the square root of the estimate of os ıs is the standard error of the IV 
estimator. This is done automatically in TSLS regression commands in econometric 
software packages. Because BIS"S is normally distributed in large samples, hypothesis 
tests about £; can be performed by computing the t-statistic, and a 95% large-sample 


confidence interval is given by B/5'* + 1.96 SE( TS+). 


Application to the Demand for Cigarettes 


Philip Wright was interested in the demand elasticity of butter, but Wright’s thinking 
could be explored with a view to estimating other important quantities. One exam- 
ple is the spending elasticity for mortality, the percentage change in avoidable mor- 
tality resulting from a 1% increase in healthcare expenditure, where researchers 
have also used an IV estimator to overcome simultaneous equation bias to inform 
health policy debates. Other examples concern other commodities, besides butter, 
such as cigarettes, which today figure more prominently in public policy debates. 
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The answer to this question depends on the elasticity of demand for cigarettes. 
If the elasticity is —1, then the 20% target in consumption can be achieved by a 20% 
increase in price. If the elasticity is —0.5, then the price must rise 40% to decrease 
consumption by 20%. Of course, we do not know the demand elasticity of cigarettes: 
We must estimate it from data on prices and sales. But, as with butter, because of the 
interactions between supply and demand, the elasticity of demand for cigarettes 
cannot be estimated consistently by an OLS regression of log quantity on log price. 

We therefore use TSLS to estimate the elasticity of demand for cigarettes using 
annual data for the 48 contiguous U.S. states for 1985 through 1995 (the data are 
described in Appendix 12.1). For now, all the results are for the cross section of states 
in 1995; results using data for earlier years (panel data) are presented in Section 12.4. 

The instrumental variable, Sales Tax; is the portion of the tax on cigarettes arising 
from the general sales tax, measured in dollars per pack (in real dollars, deflated by 
the Consumer Price Index). Cigarette consumption, Of8“”*, is the number of packs 
of cigarettes sold per capita in the state, and the price, P¢'84"*"*, is the average real 
price per pack of cigarettes including all taxes. 

Before using TSLS, it is essential to ask whether the two conditions for instru- 
ment validity hold. We return to this topic in detail in Section 12.3, where we provide 
some statistical tools that help in this assessment. Even with those statistical tools, 
judgment plays an important role, so it is useful to think about whether the sales tax 
on cigarettes plausibly satisfies the two conditions. 

First consider instrument relevance. Because a high sales tax increases the after- 
tax sales price P¢8""*, the sales tax per pack plausibly satisfies the condition for 
instrument relevance. 

Next consider instrument exogeneity. For the sales tax to be exogenous, it must be 
uncorrelated with the error in the demand equation; that is, the sales tax must affect the 
demand for cigarettes only indirectly through the price. This seems plausible: General 
sales tax rates vary from state to state, but they do so mainly because different states 
choose different mixes of sales, income, property, and other taxes to finance public 
undertakings. Those choices about public finance are driven by political considerations, 
not by factors related to the demand for cigarettes. We discuss the credibility of this 
assumption more in Section 12.4, but for now we keep it as a working hypothesis. 

In modern statistical software, the first stage of TSLS is estimated automati- 
cally, so you do not need to run this regression yourself to compute the TSLS 
estimator. Even so, it is a good idea to look at the first-stage regression. Using data 
for the 48 states in 1995, it is 


a E 
In( Psiseretes) = 4.62 + 0.031SalesTax,. (12.9) 
(0.03) (0.005) 


As expected, higher sales taxes mean higher after-tax prices. The R? of this regression 
is 47%, so the variation in sales tax on cigarettes explains 47% of the variance of ciga- 
rette prices across states. 


12:2 
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' e 
In the second stage of TSLS, 1n ( Q§'8"*") is regressed on In (P878) using 
OLS. The resulting estimated regression function is 


ne E 
In ( Oreret) = 972 — 1.081n( Psre""s), (12.10) 


This estimated regression function is written using the regressor in the second stage, 


—_-—. 
the predicted value In( P§8""*). It is, however, conventional and less cumbersome 
simply to report the estimated regression function with In(P¢84"“**) rather than 


In( P'8"""*s), Reported in this notation, the TSLS estimates and heteroskedasticity- 
robust standard errors are 
ana 
In( Q¢'se7e"5) = 9.72 — 1.081n( Peres), (12.11) 
(1.53) (0.32) 


The TSLS estimate suggests that the demand for cigarettes is surprisingly elastic in 
light of their addictive nature: An increase in the price of 1% reduces consumption 
by 1.08%. But, recalling our discussion of instrument exogeneity, perhaps this esti- 
mate should not yet be taken too seriously. Even though the elasticity was estimated 
using an instrumental variable, there might still be omitted variables that are corre- 
lated with the sales tax per pack. A leading candidate is income: States with higher 
incomes might depend relatively less on a sales tax and more on an income tax to 
finance state government. Moreover, the demand for cigarettes presumably depends 
on income. Thus we would like to reestimate our demand equation including income 
as a control variable. To do so, however, we must first extend the IV regression model 
to include additional regressors. 


The General IV Regression Model 


The general IV regression model has four types of variables: the dependent vari- 
able, Y; problematic endogenous regressors, like the price of cigarettes, which are 
correlated with the error term and which we will label X; additional regressors W, 
which are either control variables or included exogenous variables; and instrumental 
variables, Z. In general, there can be multiple endogenous regressors (X’s), multiple 
additional regressors (W’s), and multiple instrumental variables (Z’s). 

For IV regression to be possible, there must be at least as many instrumental vari- 
ables (Z’s) as endogenous regressors (X’s). In Section 12.1, there was a single endogenous 
regressor and a single instrument. Having (at least) one instrument for this single endog- 
enous regressor was essential. Without the instrument, we could not have computed the 
instrumental variables estimator: there would be no first-stage regression in TSLS. 

The relationship between the number of instruments and the number of endog- 
enous regressors has its own terminology. The regression coefficients are said to be 
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The General Instrumental Variables Regression 


12.1 


Model and Terminology 
The general IV regression model is 
Y = Bo GN + +++ BX + Bea Wi a a e Wa + uz, (12.12) 


i=1,...,n,where 


e Y; is the dependent variable; 


© Bo, Bi - - -, Bk+r are unknown coefficients; 

e X;,..., X;,; are k endogenous regressors, which are potentially correlated 
with u; 

e Wai ..., W,, are r included exogenous regressors, which are uncorrelated 


with u; or are control variables; 


e u; is the error term, which represents measurement error and/or omitted 
factors; and 


e Z,;,..., Zmj are m instrumental variables. 


The coefficients are overidentified if there are more instruments than endogenous 
regressors (m > k), they are underidentified ifm < k,and they are exactly iden- 
tified if m = k. Estimation of the IV regression model requires exact identifica- 
tion or overidentification. 


exactly identified if the number of instruments (m) equals the number of endoge- 
nous regressors (k); that is, m = k. The coefficients are overidentified if the number 
of instruments exceeds the number of endogenous regressors; that is, m > k. They 
are underidentified if the number of instruments is less than the number of endog- 
enous regressors; that is, < k.The coefficients must be either exactly identified or 
overidentified if they are to be estimated by IV regression. 

The general IV regression model and its terminology are summarized in 
Key Concept 12.1. 


Included exogenous variables and control variables in IV regression. The W vari- 
ables in Equation (12.12) can be either exogenous variables, in which case 
E(u;|W,) = 0, or they can be control variables that need not have a causal inter- 
pretation but are included to ensure that the instrument is uncorrelated with the 
error term. For example, Section 12.1 raised the possibility that the sales tax might 
be correlated with income, which economic theory tells us is a determinant of 
cigarette demand. If so, the sales tax would be correlated with the error term in the 
cigarette demand equation, In(Q¢8") = By + Bin (P8!) + u;, and thus 
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would not be an exogenous instrument. Including income in the IV regression, or 
including variables that control for income, would remove this source of potential 
correlation between the instrument and the error term. In general, if W is an effec- 
tive control variable in IV regression, then including W makes the instrument 
uncorrelated with u, so the TSLS estimator of the coefficient on X is consistent; if 
W is correlated with u, however, then the TSLS coefficient on W is subject to omit- 
ted variable bias and does not have a causal interpretation. The logic of control 
variables in IV regression therefore parallels the logic of control variables in OLS, 
discussed in Section 7.5. 

The mathematical condition for W to be an effective control variable in IV 
regression is similar to the condition on control variables in OLS discussed in 
Section 7.5. Specifically, including W must ensure that the conditional mean of u 
does not depend on Z, so conditional mean independence holds; that is, 
E(u;|Z;, W) = E(u;|W,).For clarity, in the body of this chapter we focus on the case 
that W variables are exogenous, so that E(u;|W,;) = 0. Appendix 12.6 explains how 
the results of this chapter extend to the case that Wis a control variable, in which case 
the conditional mean 0 condition, E(u;|W;) = 0,is replaced by the conditional mean 
independence condition, E(u;|Z;, W) = E(u;|W,). 


TSLS in the General IV Model 


TSLS with a single endogenous regressor. When there is a single endogenous regres- 
sor X and some additional included exogenous variables, the equation of interest is 


Y; = Bo + PiX; + Wit +++ + Bia Wa + ui, (12.13) 


where, as before, X; might be correlated with the error term, but Wi; . . . , W,; are not. 
The population first-stage regression of TSLS relates X to the exogenous vari- 
ables— that is, the W’s and the instruments (Z’s): 


Ay = m Pom Lay tie Poy t Tn Wi +++ amir Wa t vp (12.14) 


where 7, Ti, . . . , Tm+r are unknown regression coefficients and v; is an error term. 

Equation (12.14) is sometimes called the reduced form equation for X. It relates 
the endogenous variable X to all the available exogenous variables, both those 
included in the regression of interest (W) and the instruments (Z). 

In the first stage of TSLS, the unknown coefficients in Equation (12.14) are 
estimated by OLS, and the predicted values from this regression are Kea Aa 

In the second stage of TSLS, Equation (12.13) is estimated by OLS except that 
X; is replaced by its predicted value from the first stage. That is, Y; is regressed on 
Ê, W,;,..., W,; using OLS. The resulting estimator of Bo, Bi, - - - , Bi+r 1s the TSLS 
estimator. 
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Two Stage Least Squares 


The TSLS estimator in the general IV regression model in Equation (12.12) with 
multiple instrumental variables is computed in two stages: 


1. First-stage regression(s): Regress X,; on the instrumental variables 


(Z1;,..., Zmi) and the included exogenous variables and/or control variables 
(Wii... , Wi) using OLS, including an intercept. Compute the predicted val- 
ues from this regression; call these Xi. Repeat this for all the endogenous 
regressors X3;,... , X;;, thereby computing the predicted values Ge sae XG 
2. Second-stage regression: Regress Y; on the predicted values of the endogenous 
variables (ae Boe es and the included exogenous variables and/or control 
variables (W,;,..., W,;) using OLS, including an intercept. The TSLS estima- 
tors ĜTSŁS, . . . , BES“S are the estimators from the second-stage regression. 


In practice, the two stages are done automatically within TSLS estimation com- 
mands in econometric software. 


Extension to multiple endogenous regressors. When there are multiple endogenous 
regressors Xj;,..., X;;, the TSLS algorithm is similar except that each endogenous 
regressor requires its own first-stage regression. Each of these first-stage regressions 
has the same form as Equation (12.14); that is, the dependent variable is one of the 
X’s, and the regressors are all the instruments (Z’s) and all the included exogenous 
variables (W’s). Together, these first-stage regressions produce predicted values of 
each of the endogenous regressors. 

In the second stage of TSLS, Equation (12.12) is estimated by OLS except that 
the endogenous regressors (X’s) are replaced by their respective predicted values 
(X ’s). The resulting estimator of Bp, B1, - - - , Bk+r is the TSLS estimator. 

In practice, the two stages of TSLS are done automatically within TSLS estima- 
tion commands in econometric software. The general TSLS estimator is summarized 
in Key Concept 12.2. 


Instrument Relevance and Exogeneity 
in the General IV Model 


The conditions of instrument relevance and exogeneity need to be modified for the 
general IV regression model. 

When there is one included endogenous variable but multiple instruments, the 
condition for instrument relevance is that at least one Z is useful for predicting X 
given W. When there are multiple included endogenous variables, this condition is 
more complicated because we must rule out perfect multicollinearity in the second- 
stage population regression. Intuitively, when there are multiple included 
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The Two Conditions for Valid Instruments 
A set of m instruments Z,;,..., Zmi must satisfy the following two conditions to 1 2 : 3 
be valid: 


1. Instrument Relevance 


In general, let Xj; be the predicted value of X4; from the population regres- 
sion of X; on the instruments (Z’s) and the included exogenous regressors 
(W’s), and let “1” denote the constant regressor that takes on the value 1 
for all observations. Then (xX ues x z Wii, --., Wp 1) are not perfectly 
multicollinear. 


If there is only one X, then for the previous condition to hold, at least one 
Z must have a nonzero coefficient in the population regression of X on the 
Z’s and the W’s. 


2. Instrument Exogeneity 


The instruments are uncorrelated with the error term; that is, corr ( Zi; u;) = 
0,...,corr(Z,,;,u;) = 0. 


endogenous variables, the instruments must provide enough information about the 
exogenous movements in these variables to sort out their separate effects on Y. 

The general statement of the instrument exogeneity condition is that each instru- 
ment must be uncorrelated with the error term u;. The general conditions for valid 
instruments are given in Key Concept 12.3. 


The IV Regression Assumptions and Sampling 
Distribution of the TSLS Estimator 


Under the IV regression assumptions, the TSLS estimator is consistent and has a 
sampling distribution that, in large samples, is approximately normal. 


The IV regression assumptions. The IV regression assumptions are modifications of 
the least squares assumptions for causal inference in the multiple regression model 
in Key Concept 6.4. 

The first IV regression assumption modifies the conditional mean assumption in 
Key Concept 6.4 to apply only to the included exogenous variables. Just like the 
second least squares assumption for the multiple regression model, the second IV 
regression assumption is that the draws are i.i.d., as they are if the data are collected 
by simple random sampling. Similarly, the third IV assumption is that large outliers 
are unlikely. 
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The IV Regression Assumptions 


The variables and errors in the IV regression model in Key Concept 12.1 satisfy 
the following: 


1 TU || We See) = 0 

2. (X,..-, Xn, Win... , Wa Zin---, Zim Y;) are iid. draws from their joint 
distribution; 

3. Large outliers are unlikely: The X’s, W’s, Z’s, and Y have nonzero finite fourth 
moments; and 


4. The two conditions for a valid instrument in Key Concept 12.3 hold. 


The fourth IV regression assumption is that the two conditions for instrument 
validity in Key Concept 12.3 hold. The instrument relevance condition in Key Con- 
cept 12.3 subsumes the fourth least squares assumption in Key Concepts 6.4 and 6.6 
(no perfect multicollinearity) by assuming that the regressors in the second-stage 
regression are not perfectly multicollinear. The IV regression assumptions are sum- 
marized in Key Concept 12.4. 


Sampling distribution of the TSLS estimator. Under the IV regression assumptions, 
the TSLS estimator is consistent and normally distributed in large samples. This is 
shown in Section 12.1 (and Appendix 12.3) for the special case of a single endoge- 
nous regressor, a single instrument, and no included exogenous variables. Conceptu- 
ally, the reasoning in Section 12.1 carries over to the general case of multiple 
instruments and multiple included endogenous variables. The expressions in the gen- 
eral case are complicated, however, and are deferred to Chapter 19. 


Inference Using the TSLS Estimator 


Because the sampling distribution of the TSLS estimator is normal in large samples, 
the general procedures for statistical inference (hypothesis tests and confidence 
intervals) in regression models extend to TSLS regression. For example, 95% confi- 
dence intervals are constructed as the TSLS estimator + 1.96 standard errors. Simi- 
larly, joint hypotheses about the population values of the coefficients can be tested 
using the F-statistic, as described in Section 7.2. 


Calculation of TSLS standard errors. There are two points to bear in mind about 
TSLS standard errors. First, the standard errors reported by OLS estimation of the 
second-stage regression are incorrect because they do not recognize that it is the 
second stage of a two-stage process. Specifically, the second-stage OLS standard 
errors fail to adjust for the second-stage regression using the predicted values of the 
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included endogenous variables. Formulas for standard errors that make the neces- 
sary adjustment are incorporated into (and automatically used by) TSLS regression 
commands in econometric software. Therefore, this issue is not a concern in practice 
if you use a specialized TSLS regression command. 

Second, as always the error u might be heteroskedastic. It is therefore important 
to use heteroskedasticity-robust versions of the standard errors for precisely the 
same reason that it is important to use heteroskedasticity-robust standard errors for 
the OLS estimators of the multiple regression model. 


Application to the Demand for Cigarettes 


In Section 12.1, we estimated the elasticity of demand for cigarettes using data on 
annual consumption in 48 U.S. states in 1995 using TSLS with a single regressor (the 
logarithm of the real price per pack) and a single instrument (the real sales tax per 
pack). Income also affects demand, however, so it is part of the error term of the 
population regression. As discussed in Section 12.1, if the state sales tax is related to 
state income, it is correlated with a variable in the error term of the cigarette demand 
equation, which violates the instrument exogeneity condition. If so, the IV estimator 
in Section 12.1 is inconsistent. That is, the IV regression suffers from a version of 
omitted variable bias. We can solve this problem by including income in the 
regression. 

We therefore consider an alternative specification in which the logarithm of 
income is included in the demand equation. In the terminology of Key Concept 12.1, 
the dependent variable Y is the logarithm of consumption, In ( Q$8"*"*’); the endog- 
enous regressor X is the logarithm of the real after-tax price, In( P%8""*"); the 
included exogenous variable W is the logarithm of the real per capita state income, 
In(/nc;); and the instrument Z is the real sales tax per pack, Sales Tax;. The TSLS esti- 
mates and (heteroskedasticity-robust) standard errors are 


— : 
in( Qsearettes) = 9.43 — 1.14In( PSseretes) + 0.21 In(Inc;). (12.15) 
(1.26) (0.37) (0.31) 


This regression uses a single instrument, SalesTax;, but, in fact, another candidate 
instrument is available. In addition to general sales taxes, states levy special taxes that 
apply only to cigarettes and other tobacco products. These cigarette-specific taxes 
(CigTax;) constitute a possible second instrumental variable. The cigarette-specific 
tax increases the price of cigarettes paid by the consumer, so it arguably meets the 
condition for instrument relevance. If it is uncorrelated with the error term in the 
state cigarette demand equation, it is an exogenous instrument. 

With this additional instrument in hand, we now have two instrumental variables, 
the real sales tax per pack and the real state cigarette-specific tax per pack. With two 
instruments and a single endogenous regressor, the demand elasticity is overidenti- 
fied; that is, the number of instruments (SalesTax; and CigTax;,so m = 2) exceeds 
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12.3 


the number of included endogenous variables (P#8""*, so k = 1). We can estimate 
the demand elasticity using TSLS, where the regressors in the first-stage regression 
are the included exogenous variable, In(/nc;), and both instruments. 

The resulting TSLS estimate of the regression function using the two instruments 
Sales Tax; and Cig Tax; is 


Se 


In( Os8arettes) = 9.89 — 1.28In( Pests) + 0.28In(Inc;). (12.16) 
(0.96) (0.25) (0.25) 


Compare Equations (12.15) and (12.16): The standard error of the estimated price 
elasticity is smaller by one-third in Equation (12.16) [0.25 in Equation (12.16) versus 
0.37 in Equation (12.15)]. The reason the standard error is smaller in Equation (12.16) 
is that this estimate uses more information than Equation (12.15): In Equation 
(12.15), only one instrument (the sales tax) is used, but in Equation (12.16), two 
instruments (the sales tax and the cigarette-specific tax) are used. Using two instru- 
ments explains more of the variation in cigarette prices than using just one, and this 
is reflected in smaller standard errors on the estimated demand elasticity. 

Are these estimates credible? Ultimately, credibility depends on whether the set 
of instrumental variables—here, the two taxes—plausibly satisfies the two conditions 
for valid instruments. It is therefore vital that we assess whether these instruments 
are valid, and it is to this topic that we now turn. 


Checking Instrument Validity 


Whether instrumental variables regression is useful in a given application hinges on 
whether the instruments are valid: Invalid instruments produce meaningless results. 
It therefore is essential to assess whether a given set of instruments is valid in a par- 
ticular application. 


Assumption 1: Instrument Relevance 


The role of the instrument relevance condition in IV regression is subtle. One way to 
think of instrument relevance is that it plays a role akin to the sample size: The more 
relevant are the instruments—that is, the more the variation in X is explained by 
the instruments—the more information is available for use in IV regression. A more 
relevant instrument produces a more accurate estimator, just as a larger sample size 
produces a more accurate estimator. Moreover, statistical inference using TSLS is 
predicated on the TSLS estimator having a normal sampling distribution, but accord- 
ing to the central limit theorem, the normal distribution is a good approximation in 
large—but not necessarily small—samples. If having a more relevant instrument is 
like having a larger sample size, this suggests, correctly, that the more relevant is the 
instrument, the better is the normal approximation to the sampling distribution of 
the TSLS estimator and its t-statistic. 
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Instruments that explain little of the variation in X are called weak instruments. 
In the cigarette example, the distance of the state from cigarette manufacturing 
plants arguably would be a weak instrument: Although a greater distance increases 
shipping costs (thus shifting the supply curve in and raising the equilibrium price), 
cigarettes are lightweight, so shipping costs are a small component of the price of 
cigarettes. Thus the amount of price variation explained by shipping costs, and thus 
distance to manufacturing plants, probably is quite small. 

This section discusses why weak instruments are a problem, how to check for 
weak instruments, and what to do if you have weak instruments. It is assumed 
throughout that the instruments are exogenous. 


Why weak instruments are a problem. If the instruments are weak, then the nor- 
mal distribution provides a poor approximation to the sampling distribution of the 
TSLS estimator, even if the sample size is large. Thus there is no theoretical justifica- 
tion for the usual methods for performing statistical inference, even in large samples. 
In fact, if instruments are weak, then the TSLS estimator can be badly biased in the 
direction of the OLS estimator. In addition, 95% confidence intervals constructed 
as the TSLS estimator + 1.96 standard errors can contain the true value of the coef- 
ficient far less than 95% of the time. In short, if instruments are weak, TSLS is no 
longer reliable. 

To see that there is a problem with the large-sample normal approximation to 
the sampling distribution of the TSLS estimator, consider the special case, intro- 
duced in Section 12.1, of a single included endogenous variable, a single instru- 
ment, and no included exogenous regressor. If the instrument is valid, then BETS 
is consistent because the sample covariances szy and szy are consistent; that is, 
BISLS = szy /szx => cov(Z, Y,)/cov( Z; X;) = B, [Equation (12.7)]. But now 
suppose that the instrument is not just weak but in fact is irrelevant, so that 
cov( Z; X) = 0.Then szy —— cov(Z;, X;) = 0, so, taken literally, the denomina- 
tor on the right-hand side of the limit cov( Z; Y;) /cov(Z;, X;) is 0! Clearly, the 
argument that TSS is consistent breaks down when the instrument relevance 
condition fails. As shown in Appendix 12.4, this breakdown results in the TSLS 
estimator having a nonnormal sampling distribution, even if the sample size is 
very large. In fact, when the instrument is irrelevant, the large-sample distribution 
of gr SLS is not the distribution of a normal random variable but rather the distribu- 
tion of a ratio of two normal random variables! As discussed in Appendix 12.4, this 
ratio-of-normals distribution is centered at the large-sample value of the OLS 
estimator. 

While this circumstance of totally irrelevant instruments might not be encoun- 
tered in practice, it raises a question: How relevant must the instruments be for the 
normal distribution to provide a good approximation in practice? The answer to this 
question in the general IV model is complicated. Fortunately, however, there is a 
simple rule of thumb available for the most common situation in practice, the case of 
a single endogenous regressor. 
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A Rule of Thumb for Checking for Weak Instruments 


Ze 


The first-stage F-statistic is the F-statistic testing the hypothesis that the coeffi- 
cients on the instruments Z4; . . . , Zm; equal 0 in the first stage of two stage least 
squares. When there is a single endogenous regressor, a first-stage F-statistic less 
than 10 indicates that the instruments are weak, in which case the TSLS estimator 
is biased (even in large samples) and TSLS t-statistics and confidence intervals 
are unreliable. 


Checking for weak instruments when there is a single endogenous regressor. One 
way to check for weak instruments when there is a single endogenous regressor is to 
compute the F-statistic testing the hypothesis that the coefficients on the instruments 
are all 0 in the first-stage regression of TSLS. This first-stage F-statistic provides a 
measure of the information content contained in the instruments: The more informa- 
tion content, the larger the expected value of the F-statistic. One simple rule of 
thumb is that you do not need to worry about weak instruments if the first-stage 
F-statistic exceeds 10. (Why 10? See Appendix 12.5.) This is summarized in 
Key Concept 12.5. 


What do | do if | have weak instruments? If you have many instruments, some of 
those instruments are probably weaker than others. If you have a small number of 
strong instruments and many weak ones, you will be better off discarding the weakest 
instruments and using the most relevant subset for your TSLS analysis. Your TSLS 
standard errors might increase when you drop weak instruments, but keep in mind 
that your original standard errors were not meaningful anyway! 

If, however, the coefficients are exactly identified, you cannot discard the weak 
instruments. Even if the coefficients are overidentified, you might not have enough 
strong instruments to achieve identification, so discarding some weak instruments will 
not help. In this case, you have two options. The first option is to find additional, stronger 
instruments. This is easier said than done: It requires an intimate knowledge of the prob- 
lem at hand and can entail redesigning the data set and the nature of the empirical study. 
The second option is to proceed with your empirical analysis using the weak instru- 
ments, but employing methods other than TSLS. Although this chapter has focused on 
TSLS, some other methods for instrumental variable analysis are less sensitive to weak 
instruments than TSLS, and some of these methods are discussed in Appendix 12.5. 


Assumption 2: Instrument Exogeneity 


If the instruments are not exogenous, then TSLS is inconsistent: The TSLS estimator 
converges in probability to something other than the causal coefficient. After all, the 
idea of instrumental variables regression is that the instrument contains information 
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The First IV Regression 


A he and his son Sewall derived the TV esti- 
mator (see the box “Who Invented Instrumen- 
tal Variables Regression?”), Philip Wright set out to 
see how it worked in practice. In a letter to Sewall of 
March 15, 1926, Philip wrote out a table (reproduced 
here in part) of annual data on variables relating to 
U.S. production of flaxseed from 1903 through 1925. 
Flaxseed was grown for its oil, also called linseed oil, 
which was used in oil-based paint for buildings. Philip 
wanted to estimate the elasticity of supply. To get a per- 
cent-percent relationship, he first transformed the data 
to be percentage deviations from a long-term trend. 
Philip then needed to make a key decision: What 
instrument should he use? He chose building per- 
mits on the East Coast. He reasoned that if there 
were more new buildings, there would be more 
demand for oil-based paint and thus for flaxseed, so 
the instrument would be relevant. He further rea- 
soned that fluctuations in building permits on the 
East Coast were largely driven by broader economic 
conditions that had nothing to do with disturbances 
to flaxseed supply in a given year, so that building 
permits would be exogenous. Said differently, fluc- 
tuations in building permits on the East Coast were 


a determinant of demand but not of supply. 


After laborious computations — by hand, of course — 
Philip obtained the TV estimate of the supply elasticity, 
—0.88. This elasticity has the wrong sign: It suggests that 
the supply curve slopes down. In the March 15 letter, 
Philip called this result “obviously absurd.” 

So what went wrong? Although Philip did not 
know it, his IV regression had a first-stage F-statistic 
of 1.75, far less than the rule-of-thumb cutoff of 10. 
As explained in the text and in Appendix 12.4, when 
the instrument is irrelevant, its distribution centers 
on the OLS estimate, which in Wright’s data is —0.66. 
This first IV regression had a very weak instrument, 
and the result was biased toward OLS. 

But Philip persevered. For estimating the demand 
elasticity, he had as an instrument rainfall in the 
Upper Midwest, where flaxseed was grown. More 
rain makes for a better harvest, so rainfall is plau- 
sibly relevant; because rainfall in the Midwest does 
not affect the demand for oil paint, it is plausibly 
exogenous. Rainfall, it turns out, has a first-stage F of 
12.8 and yields an IV estimate of the demand elastic- 
ity of —0.48. This estimate indicates that the demand 
curve slopes down (as it should) and that demand is 
inelastic, which is consistent with there being no good 


substitute for linseed oil for paints during this period. 


The First Five Observations of the First IV Regression Data Set, from Philip Wright's Letter to Sewall 


Wright of March 15, 1926. 
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The first two data columns are the real price and quantity (“output”) of flaxseed. The “B” variables— acreage 
planted, yield, rainfall in the Upper Midwest, and the ratio of flaxseed yield that year to spring wheat yield the 
previous year—shift supply but not demand, so they are potential instruments for the demand elasticity. The “A” 
variable—building permits on the East Coast—shifts demand but not supply, so it is a potential instrument for the 


supply elasticity. 
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about variation in X; that is unrelated to the error term y;. If, in fact, the instrument 
is not exogenous, it cannot pinpoint this exogenous variation in X;, and it stands to 
reason that IV regression fails to provide a consistent estimator. The math behind 
this argument is summarized in Appendix 12.4. 


Can you statistically test the assumption that the instruments are exogenous? Yes 
and no. On the one hand, it is not possible to test the hypothesis that the instruments 
are exogenous when the coefficients are exactly identified. On the other hand, if the 
coefficients are overidentified, it is possible to test the overidentifying restrictions — 
that is, to test the hypothesis that the “extra” instruments are exogenous under the 
maintained assumption that there are enough valid instruments to identify the coef- 
ficients of interest. 

First consider the case that the coefficients are exactly identified, so you have as 
many instruments as endogenous regressors. Then it is impossible to develop a sta- 
tistical test of the hypothesis that the instruments are, in fact, exogenous. That is, 
empirical evidence cannot be brought to bear on the question of whether these 
instruments satisfy the exogeneity restriction. In this case, the only way to assess 
whether the instruments are exogenous is to draw on expert opinion and your per- 
sonal knowledge of the empirical problem at hand. For example, Philip Wright’s 
knowledge of agricultural supply and demand led him to suggest that below-average 
rainfall would plausibly shift the supply curve for fats and oils but would not directly 
shift the demand curve. 

Assessing whether the instruments are exogenous necessarily requires making 
an expert judgment based on personal knowledge of the application. If, however, 
there are more instruments than endogenous regressors, then there is a statistical 
tool that can be helpful in this process: the so-called test of overidentifying 
restrictions. 


The overidentifying restrictions test. Suppose you have a single endogenous regres- 
sor and two instruments. Then you could compute two different TSLS estimators: one 
using the first instrument and the other using the second. These two estimators will 
not be the same because of sampling variation, but if both instruments are exoge- 
nous, then they will tend to be close to each other. But what if these two instruments 
produce very different estimates? You might sensibly conclude that there is some- 
thing wrong with one or the other of the instruments or with both. That is, it would 
be reasonable to conclude that one or the other or both of the instruments are not 
exogenous. 

The test of overidentifying restrictions implicitly makes this comparison. We say 
implicitly because the test is carried out without actually computing all of the different 
possible IV estimates. Here is the idea. Exogeneity of the instruments means that they 
are uncorrelated with u;. This suggests that the instruments should be approximately 


uncorrelated with i75"5, where a]°45 = Y, — (BgS'S + BPSSX,, + --- + BEESW,,) 
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The Overidentifying Restrictions Test (The J-Statistic) 
Let 7°“ be the residuals from TSLS estimation of Equation (12.12). Use OLS to 1 2 : 6 
estimate the regression coefficients in 


Bp = 85 + O e F G Omani + °° + OmarWe + ep (12.17) 


where e; is the regression error term. Let F denote the homoskedasticity-only 
F-statistic testing the hypothesis that 6; = --- = ôm = 0. The overidentifying 
restrictions test statistic is J = mF. Under the null hypothesis that all the instru- 
ments are exogenous, if e; is homoskedastic, in large samples J is distributed y7,_ x, 
where m — k is the degree of overidentification—that is, the number of instru- 
ments minus the number of endogenous regressors. 


is the residual from the estimated TSLS regression using all the instruments (approxi- 
mately rather than exactly because of sampling variation). (Note that these residuals 
are constructed using the true X’s rather than their first-stage predicted values.) 
Accordingly, if the instruments are, in fact, exogenous, then the coefficients on the 
instruments in a regression of i#/°“* on the instruments and the included exogenous 
variables should all be 0, and this hypothesis can be tested. 

This method for computing the overidentifying restrictions test is summarized in 
Key Concept 12.6.This statistic is computed using the homoskedasticity-only F-statistic. 
The test statistic is commonly called the J-statistic and is computed as J = mF. 

In large samples, if the instruments are not weak and the errors are homoskedas- 
tic, then, under the null hypothesis that the instruments are exogenous, the J-statistic 
has a chi-squared distribution with m — k degrees of freedom ( y7,_;). It is important 
to remember that even though the number of restrictions being tested is m, the 
degrees of freedom of the asymptotic distribution of the J-statistic is m — k. The 
reason is that it is possible to test only the overidentifying restrictions, of which there 
are m — k. The modification of the J-statistic for heteroskedastic errors is given in 
Section 19.7 

The easiest way to see that you cannot test the exogeneity of the regressors when 
the coefficients are exactly identified (m = k) is to consider the case of a single 
included endogenous variable (k = 1). If there are two instruments, then you can 
compute two TSLS estimators, one for each instrument, and you can compare them 
to see if they are close. But if you have only one instrument, then you can compute 
only one TSLS estimator, and you have nothing to which to compare it. In fact, if the 
coefficients are exactly identified, so that m = k, then the overidentifying test statis- 
tic J is exactly 0. 
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12.4 


Application to the Demand for Cigarettes' 


Our attempt to estimate the elasticity of demand for cigarettes left off with the TSLS 
estimates summarized in Equation (12.16), in which income was an included exoge- 
nous variable and there were two instruments, the general sales tax and the cigarette- 
specific tax. We can now undertake a more thorough evaluation of these instruments. 

As in Section 12.1, it makes sense that the two instruments are relevant because 
taxes are a big part of the after-tax price of cigarettes, and shortly we will look at this 
empirically. First, however, we focus on the difficult question of whether the two tax 
variables are plausibly exogenous. 

The first step in assessing whether an instrument is exogenous is to think through 
the arguments for why it may or may not be. This requires thinking about which fac- 
tors account for the error term in the cigarette demand equation and whether these 
factors are plausibly related to the instruments. 

Why do some states have higher per capita cigarette consumption than others? 
One reason might be variation in incomes across states, but state income is included in 
Equation (12.16), so this is not part of the error term. Another reason is that there are 
historical factors influencing demand. For example, states that grow tobacco have 
higher rates of smoking than most other states. Could this factor be related to taxes? 
Quite possibly: If tobacco farming and cigarette production are important industries in 
a state, then these industries could exert influence to keep cigarette-specific taxes low. 
This suggests that an omitted factor in cigarette demand— whether the state grows 
tobacco and produces cigarettes—could be correlated with cigarette-specific taxes. 

One solution to this possible correlation between the error term and the instru- 
ment would be to include information on the size of the tobacco and cigarette indus- 
try in the state; this is the approach we took when we included income as a regressor 
in the demand equation. But because we have panel data on cigarette consumption, 
a different approach is available that does not require this information. As discussed 
in Chapter 10, panel data make it possible to eliminate the influence of variables that 
vary across entities (states) but do not change over time, such as the historical cir- 
cumstances that lead to a large tobacco and cigarette industry in a state. Two methods 
for doing this were given in Chapter 10: constructing data on changes in the variables 
between two different time periods and using fixed effects regression. To keep the 
analysis here as simple as possible, we adopt the former approach and perform 
regressions of the type described in Section 10.2, based on the changes in the vari- 
ables between two different years. 

The time span between the two different years influences how the estimated elas- 
ticities are to be interpreted. Because cigarettes are addictive, changes in price will 
take some time to alter behavior. At first, an increase in the price of cigarettes might 
have little effect on demand. Over time, however, the price increase might contribute 


'This section assumes knowledge of the material in Sections 10.1 and 10.2 on panel data with T = 2 time 
periods. 
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The Externalities of Smoking 


| tis often said that smoking creates negative exter- 
nalities or costs, such as those of healthcare and 
cleaning, which are imposed on third parties by the 
act of smoking. Outright bans on smoking in various 
locations —at the workplace or in public locations — 
have been suggested or imposed in Western Europe 
in recent years, such as in France and England in 
2007, and in the Netherlands in 2008. Economists, 
however, often object to this, and suggest imposing 
taxes to correct for these. 

It is usually suggested that these taxes should 
be imposed at such a level that the external cost is 
reduced to zero, by its burden being shifted onto 
the smoker in this way. We could use econometric 
techniques to estimate this external cost, and subse- 
quently the required tax. 

Such estimation is no simple matter, however. The 
U.K. Government estimates that the smoking-related 
cost to the National Health Service (NHS) in 2015 
was £2.6 billion, but this does not adjust for costs 
that would have been imposed anyway.! How much 
would it have cost to treat these people for other ill- 
nesses had they not smoked? Are there potentially 
other benefits and costs that this misses? If smok- 


ers die young, how do we value the foregone benefit 


of their lost life years? What about the value of the 
employment that smoking generates? 

One recent academic review of available evi- 
dence points to various different such potential costs 
and benefits, but ultimately concludes that the exter- 
nal costs of smoking “far outweigh any benefits.”” 
This suggests that if tax is the lever we wish to use to 
change smoking behavior, taxes on tobacco should 
rise. We must recognize, however, that the exact 
value of these benefits and costs is dependent on 
what we actually consider to be benefits and costs 
and can only be estimated. While econometricians 
can advise on policy questions such as these, they 


will still remain questions of political contention. 


'The data on the cost of smoking to the NHS in England 
in 2015 is an ad hoc statistical publication from July 2017. 
The analysis was undertaken by Public Health England 
(PHE) to support the development of the new Tobacco 
Control Plan for England. For more information, see “Cost 
of smoking to the NHS in England: 2015,” on https://www 
.gov.uk/ 

Read the article “The Economic Impact of Smoking and 
of Reducing Smoking Prevalence: Review of Evidence,” 
by Victor U. Ekpu and Abraham K. Brown, published by 
the U.S. National Library of Medicine National Institutes 
of Health, https://www.ncbi.nlm.nih.gov, July 14, 2015. 
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to some smokers’ desire to quit, and, importantly, it could discourage nonsmokers 
from taking up the habit. Thus the response of demand to a price increase could be 
small in the short run but large in the long run. Said differently, for an addictive prod- 
uct like cigarettes, demand might be inelastic in the short run—that is, it might have a 
short-run elasticity near 0—but it might be more elastic in the long run. 

In this analysis, we focus on estimating the long-run price elasticity. We do 
this by considering quantity and price changes that occur over 10-year periods. 
Specifically, in the regressions considered here, the 10-year change in log 
quantity, In( Osa") — In( Qsisare"’s) is regressed against the 10-year change in 


i,1995 i,1985 
log price, In( Pegg’) — In( psigvc"’’), and the 10-year change in log income, 


In (Inci1995) — In(Inc;1935). Two instruments are used: the change in the sales tax 
over 10 years, Sales Tax; 1995 — SalesTax; 195, and the change in the cigarette-specific 


tax over 10 years, CigTax; 1995 — CigTax; 1985- 
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WESPA Two Stage Least Squares Estimates of the Demand for Cigarettes Using 
Panel Data for 48 U.S. States 
Dependent variable: In(Qfgare"**) — In(Qeigarettes) 
Regressor (1) (2) (3) 
In Peigarettes —In Psigarettes —0.94 —1.34 —1.20 
(Piss o) — In Pias) (0.21) (0.23) (0.20) 
[-1.36, —0.52] [-1L80, —0.88] [-1.60, —0.81] 

In (Inci 199s ) = In(Inci1985) 0.53 0.43 0.46 

(0.34) (0.30) (0.31) 

[—0.16, 1.21] [—0.16, 1.02] [-0.16, 1.09] 

Intercept —0.12 —0.02 —0.05 

(0.07) (0.07) (0.06) 
Instrumental variable(s) Sales tax Cigarette-specific tax Both sales tax and 

cigarette-specific tax 

First-stage F-statistic 33.7 107.2 88.6 
Overidentifying restrictions — — 4.93 
J-test and p-value (0.026) 
These regressions were estimated using data for 48 U.S. states (48 observations on the 10-year differences). The data are 
described in Appendix 12.1. The J-test of overidentifying restrictions is described in Key Concept 12.6 (its p-value is given 
in parentheses), and the first-stage F-statistic is described in Key Concept 12.5. Heteroskedasticity-robust standard errors 
are given in parentheses beneath coefficients, and 95% confidence intervals are given in brackets. 


A et 


The results are presented in Table 12.1. As usual, each column in the table pres- 
ents the results of a different regression. All regressions have the same regressors, and 
all coefficients are estimated using TSLS; the only difference among the three regres- 
sions is the set of instruments used. In column (1), the only instrument is the sales 
tax; in column (2), the only instrument is the cigarette-specific tax; and in column (3), 
both taxes are used as instruments. 

In IV regression, the reliability of the coefficient estimates hinges on the validity 
of the instruments, so the first things to look at in Table 12.1 are the diagnostic sta- 
tistics assessing the validity of the instruments. 

First, are the instruments relevant? We need to look at the first-stage F-statistics. 
The first-stage regression in column (1) is 

es 
In( Pos) T In( PSRs") = 0.53 — 0.22[In(Unc;1995) — In(Incj1985) | 
(0.03) (0.22) 


+ 0.0255 ( Sales Tax; 1995 — SalesTax; 195). (12.18) 
(0.0044) 
Because there is only one instrument in this regression, the first-stage F-statistic 


is the square of the f-statistic testing that the coefficient on the instrumental variable, 
Sales Tax;1995 — Sales Tax; 19s, is 0; this is F = t* = (0.0255/0.0044)* = 33.7.For the 
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regressions in columns (2) and (3), the first-stage F-statistics are 1072 and 88.6, so in 
all three cases the first-stage F-statistics exceed 10. We conclude that the instruments 
are not weak, so we can rely on the standard methods for statistical inference (hypoth- 
esis tests and confidence intervals) using the TSLS coefficients and standard errors. 

Second, are the instruments exogenous? Because the regressions in columns (1) 
and (2) each have a single instrument and a single included endogenous regressor, 
the coefficients in those regressions are exactly identified. Thus we cannot deploy the 
J-test in either of those regressions. The regression in column (3), however, is overi- 
dentified because there are two instruments and a single included endogenous 
regressor, so there is one (m — k = 2 — 1 = 1) overidentifying restriction. The 
J-statistic is 4.93; this has a y7 distribution, so the 5% critical value is 3.84 (Appendix 
Table 3 ) and the null hypothesis that both the instruments are exogenous is rejected 
at the 5% significance level (this deduction also can be made directly from the 
p-value of 0.026, reported in the table). 

The reason the J-statistic rejects the null hypothesis that both instruments are 
exogenous is that the two instruments produce rather different estimated coefficients. 
When the only instrument is the sales tax [column (1)], the estimated price elasticity 
is —0.94, but when the only instrument is the cigarette-specific tax, the estimated price 
elasticity is —1.34. Recall the basic idea of the J-statistic: If both instruments are exog- 
enous, then the two TSLS estimators using the individual instruments are consistent 
and differ from each other only because of random sampling variation. If, however, 
one of the instruments is exogenous and one is not, then the estimator based on the 
endogenous instrument is inconsistent, which is detected by the J-statistic. In this 
application, the difference between the two estimated price elasticities is sufficiently 
large that it is unlikely to be the result of pure sampling variation, so the J-statistic 
rejects the null hypothesis that both the instruments are exogenous. 

The J-statistic rejection means that the regression in column (3) is based on 
invalid instruments (the instrument exogeneity condition fails). What does this imply 
about the estimates in columns (1) and (2)? The J-statistic rejection says that at least 
one of the instruments is endogenous, so there are three logical possibilities: The sales 
tax is exogenous but the cigarette-specific tax is not, in which case the column (1) 
regression is reliable; the cigarette-specific tax is exogenous but the sales tax is not, 
so the column (2) regression is reliable; or neither tax is exogenous, so neither regres- 
sion is reliable. The statistical evidence cannot tell us which possibility is correct, so 
we must use our judgment. 

We think that the case for the exogeneity of the general sales tax is stronger than 
that for the cigarette-specific tax because the political process can link changes in the 
cigarette-specific tax to changes in the cigarette market and smoking policy. For 
example, if smoking decreases in a state because it falls out of fashion, there will be 
fewer smokers and a weakened lobby against cigarette-specific tax increases, which 
in turn could lead to higher cigarette-specific taxes. Thus changes in tastes (which are 
part of u) could be correlated with changes in cigarette-specific taxes (the instru- 
ment). This suggests discounting the IV estimates that use the cigarette-only tax as 
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12.5 


an instrument and adopting the price elasticity estimated using the general sales tax 
as an instrument, —0.94. 

The estimate of —0.94 indicates that cigarette consumption is somewhat elastic: 
An increase in price of 1% leads to a decrease in consumption of 0.94%. This may 
seem surprising for an addictive product like cigarettes. But remember that this elas- 
ticity is computed using changes over a 10-year period, so it is a long-run elasticity. 
This estimate suggests that increased taxes can make a substantial dent in cigarette 
consumption, at least in the long run. 

When the elasticity is estimated using 5-year changes from 1985 to 1990 rather 
than the 10-year changes reported in Table 12.1, the elasticity (estimated with the 
general sales tax as the instrument) is —0.79; for changes from 1990 to 1995, the 
elasticity is —0.68. These estimates suggest that demand is less elastic over horizons 
of 5 years than over 10 years. This finding of greater price elasticity at longer horizons 
is consistent with the large body of research on cigarette demand. Demand elasticity 
estimates in that literature typically fall in the range —0.3 to —0.5, but these are 
mainly short-run elasticities; some studies suggest that the long-run elasticity could 
be perhaps twice the short-run elasticity. 


Where Do Valid Instruments Come From? 


In practice, the most difficult aspect of IV estimation is finding instruments that are 
both relevant and exogenous. There are two main approaches, which reflect two dif- 
ferent perspectives on econometric and statistical modeling. 

The first approach is to use economic theory to suggest instruments. For exam- 
ple, Philip Wright’s understanding of the economics of agricultural markets led him 
to look for an instrument that shifted the supply curve but not the demand curve; this 
in turn led him to consider weather conditions in agricultural regions. One area 
where this approach has been particularly successful is the field of financial econom- 
ics. Some economic models of investor behavior involve statements about how inves- 
tors forecast, which then imply sets of variables that are uncorrelated with the error 
term. Those models sometimes are nonlinear in the data and in the parameters, in 
which case the IV estimators discussed in this chapter cannot be used. An extension 
of IV methods to nonlinear models, called generalized method of moments estima- 
tion, is used instead. Economic theories are, however, abstractions that often do not 
take into account the nuances and details necessary for analyzing a particular data 
set. Thus this approach does not always work. 

The second approach to constructing instruments is to look for some exogenous 
source of variation in X arising from what is, in effect, a random phenomenon that 


?A sobering economic study by Adda and Cornaglia (2006) suggests that smokers compensate for higher 
taxes by smoking more intensively, thus extracting more nicotine per cigarette. If you are interested in 
learning more about the economics of smoking, see Chaloupka and Warner (2000), Gruber (2001), and 
Carpenter and Cook (2008). 
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induces shifts in the endogenous regressor. For example, in our hypothetical example 
in Section 12.1, earthquake damage increased average class size in some school dis- 
tricts, and this variation in class size was unrelated to potential omitted variables that 
affect student achievement. This approach typically requires knowledge of the prob- 
lem being studied and careful attention to the details of the data, and it is best 
explained through examples. 


Three Examples 


We now turn to three empirical applications of IV regression that illustrate how dif- 
ferent researchers used their expert knowledge of their empirical problem to find 
instrumental variables. 


Do economic institutions affect economic development? The single question that 
has troubled economists since Adam Smith the most is why some nations are rich while 
others remain poor. Unpicking the various mechanisms that lead to economic growth 
and evaluating the contribution of each mechanism requires a combination of theory 
and empirical analysis. However, such empirical analysis is not as straightforward as it 
seems. For example, the role played by institutions, such as legal institutions that facili- 
tate the ownership of property. It is quite plausible that strong institutions that foster 
property rights could lead to higher economic growth if they incentivized a more effi- 
cient use of scarce resources. 

Disentangling this particular issue is challenging, precisely because economic 
institutions and economic growth are so interconnected. This means that a simple 
regression of some measure of economic development (GDP per capita) against a 
measure of institutions, such as protection against expropriation (the strength of the 
property rights in a country), will yield a biased estimate of the causal effect of insti- 
tutions on economic development even if the analyst controls for a number of other 
factors affecting economic development, such as whether a country is landlocked or 
not. This results from the serious potential for simultaneous causality bias in this 
analysis: Stronger institutions can lead to greater economic development. Conversely, 
however, economic growth could enable the creation of these kinds of institutions 
and institutional arrangements. As a result there is a “chicken and egg” situation 
where it is not clear which comes first. As in the butter example in Figure 12.1, 
because of this simultaneous causality, an OLS regression of economic development 
on a measure of institutions will estimate some complicated combination of these 
two effects. This problem cannot be solved by finding better control variables. 

This simultaneous causality bias, however, can be eliminated by finding a suit- 
able instrumental variable and using TSLS. The instrument must be correlated with 
the measure of institutions (it must be relevant), but it must also be uncorrelated 
with the error term in the economic development equation of interest (it must be 
exogenous). That is it must affect the measure of institutions but be unrelated to 
any of the unobserved factors that determine economic development. 
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Things that might affect the ability to have strong economic institutions are very 
likely to be related to the economic performance of a country. So where does one 
find something that affects institutions but has no direct effect on economic develop- 
ment? Because it takes a long time for institutions to become established, one idea 
is to consider the history of how economic institutions were first developed. Plausibly 
there may be factors from hundreds of years ago that were relevant in the initial 
founding of institutions, but are not related to the level of economic development 
today except for through their impact on institutions. Specifically, Acemoglu et al. 
(2001) consider the colonial origins of economic institutions. They argue that the 
potential mortality rate among settlers was influential in determining whether Euro- 
pean countries established “Neo-Europes” involving setting up European-style insti- 
tutions that protected private property rights or instead set up “extractive states.” 
They further argue that these differences in institutions persist to the present day. 

Are measures of potential settler mortality valid instruments? Although Acemo- 
glu et al. did not report first-stage F-statistics, settler mortality alone was found to 
explain 27% of the levels of current institutions, suggesting that this instrument is 
relevant. The argument that the instruments are exogenous requires that settler 
mortality only affects economic development through the effect on institutions. As a 
robustness check, to investigate whether settler mortality may have been caused by 
diseases that still exist and that may hamper economic performance today, Acemoglu 
et al. include prevalence of malaria in their regression. They find that the inclusion 
of this regressor makes little difference to the resulting regression coefficients. In 
addition, because Acemoglu et al. break down the causal pathway through which 
settler mortality affects current institutions into three parts, there are three instru- 
ments and, therefore, overidentifying restrictions can be tested. The failure to reject 
the null hypotheses of these tests bolsters the case that the instruments are valid. 

Using these instruments and TSLS, Acemoglu et al. estimated the effect on eco- 
nomic development of institutions to be substantial. This estimated effect was twice 
as large as the effect estimated using OLS, suggesting that OLS suffered from large 
simultaneous causality bias. In addition, they find that in the TSLS model neither the 
coefficient on the dummy for Africa nor a country’s distance from the equator are 
statistically significant suggesting that “Africa is poorer than the rest of the world not 


because of pure geographic or cultural factors, but because of worse institutions.”* 


Does cutting class sizes increase test scores? As we saw in the empirical analysis of 
Part II, schools with small classes tend to be wealthier, and their students have access 
to enhanced learning opportunities both in and out of the classroom. In Part I, we 


3For further reading see Daron Acemoglu, Simon Johnson, James A. Robinson’s The Colonial Origins of 
Comparative Development: An Empirical Investigation, The American Economic Review, Vol. 91, No. 5, 
December, 2001. 


“Tf you are interested in learning more about this empirical analysis and the response to it by other econo- 
mists, see the original paper Acemoglu et al. (2001), the comment on it by Albouy (2012), and the reply 
to the comment Acemoglu et al. (2012). 
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used multiple regression to tackle the threat of omitted variables bias by controlling 
for various measures of student affluence, ability to speak English, and so forth. Still, 
a skeptic could wonder whether we did enough: If we left out something important, 
our estimates of the class size effect would still be biased. 

This potential omitted variables bias could be addressed by including the right 
control variables, but if these data are unavailable (some, like outside learning oppor- 
tunities, are hard to measure), then an alternative approach is to use IV regression. 
This regression requires an instrumental variable correlated with class size (rele- 
vance) but uncorrelated with the omitted determinants of test performance that 
make up the error term, such as parental interest in learning, learning opportunities 
outside the classroom, quality of the teachers and school facilities, and so forth 
(exogeneity). 

Where does one look for an instrument that induces random, exogenous varia- 
tion in class size, but is unrelated to the other determinants of test performance? 
Hoxby (2000) suggested biology. Because of random fluctuations in timings of births, 
the size of the incoming kindergarten class varies from one year to the next. Although 
the actual number of children entering kindergarten might be endogenous (recent 
news about the school might influence whether parents send a child to a private 
school), she argued that the potential number of children entering kindergarten—the 
number of 4-year-olds in the district—is mainly a matter of random fluctuations in 
the birth dates of children. 

Is potential enrollment a valid instrument? Whether it is exogenous depends on 
whether it is correlated with unobserved determinants of test performance. Surely 
biological fluctuations in potential enrollment are exogenous, but potential enroll- 
ment also fluctuates because parents with young children choose to move into an 
improving school district and out of one in trouble. If so, an increase in potential 
enrollment could be correlated with unobserved factors such as the quality of school 
management, rendering this instrument invalid. Hoxby addressed this problem by 
reasoning that growth or decline in the potential student pool for this reason would 
occur smoothly over several years, whereas random fluctuations in birth dates would 
produce short-term “spikes” in potential enrollment. Thus she used as her instrument 
not potential enrollment, but the deviation of potential enrollment from its long- 
term trend. These deviations satisfy the criterion for instrument relevance (the first- 
stage F-statistics all exceed 100). She makes a good case that this instrument is 
exogenous, but, as in all IV analysis, the credibility of this assumption is ultimately a 
matter of judgment. 

Hoxby implemented this strategy using detailed panel data on elementary 
schools in Connecticut in the 1980s and 1990s. The panel data set permitted her to 
include school fixed effects, which, in addition to the instrumental variables strategy, 
attack the problem of omitted variables bias at the school level. Her TSLS estimates 
suggested that the effect on test scores of class size is small; most of her estimates 
were Statistically insignificantly different from 0. 
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Does aggressive treatment of heart attacks prolong lives? Aggressive treatments 
for victims of heart attacks (technically, acute myocardial infarctions, or AMIs) hold 
the potential for saving lives. Before a new medical procedure —in this example, 
cardiac catheterization®—is approved for general use, it goes through clinical trials, 
a series of randomized controlled experiments designed to measure its effects and 
side effects. But strong performance in a clinical trial is one thing; actual performance 
in the real world is another. 

A natural starting point for estimating the real-world effect of cardiac catheter- 
ization is to compare patients who received the treatment to those who did not. This 
leads to regressing the length of survival of the patient against the binary treatment 
variable (whether the patient received cardiac catheterization) and other control 
variables that affect mortality (age, weight, other measured health conditions, and so 
forth). The population coefficient on the indicator variable is the increment to the 
patient’s life expectancy provided by the treatment. Unfortunately, the OLS estima- 
tor is subject to bias: Cardiac catheterization does not “just happen” to a patient 
randomly; rather, it is performed because the doctor and patient decide that it might 
be effective. If their decision is based in part on unobserved factors relevant to health 
outcomes not in the data set, the treatment decision will be correlated with the 
regression error term. If the healthiest patients are the ones who receive the treat- 
ment, the OLS estimator will be biased (treatment is correlated with an omitted 
variable), and the treatment will appear more effective than it really is. 

This potential bias can be eliminated by IV regression using a valid instrumental 
variable. The instrument must be correlated with treatment (must be relevant) but 
must be uncorrelated with the omitted health factors that affect survival (must be 
exogenous). 

Where does one look for something that affects treatment but does not affect the 
health outcome other than through its effect on treatment? McClellan, McNeil, and 
Newhouse (1994) suggested geography. Most hospitals in their data set did not offer 
cardiac catheterization, so many patients were closer to “regular” hospitals that did 
not offer this treatment than to cardiac catheterization hospitals. McClellan, McNeil, 
and Newhouse therefore used as an instrumental variable the difference between the 
distance from the AMI patient’s home to the nearest cardiac catheterization hospital 
and the distance to the nearest hospital of any sort; this distance is 0 if the nearest 
hospital is a cardiac catheterization hospital, and otherwise it is positive. If this rela- 
tive distance affects the probability of receiving this treatment, then it is relevant. If 
it is distributed randomly across AMI victims, then it is exogenous. 

Is relative distance to the nearest cardiac catheterization hospital a valid instru- 
ment? McClellan, McNeil, and Newhouse do not report first-stage F-statistics, but they 
do provide other empirical evidence that it is not weak. Is this distance measure exog- 
enous? They make two arguments. First, they draw on their medical expertise and 


*Cardiac catheterization is a procedure in which a catheter, or tube, is inserted into a blood vessel and 
guided all the way to the heart to obtain information about the heart and coronary arteries. 
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knowledge of the health care system to argue that distance to a hospital is plausibly 
uncorrelated with any of the unobservable variables that determine AMI outcomes. 
Second, they have data on some of the additional variables that affect AMI outcomes, 
such as the weight of the patient, and in their sample, distance is uncorrelated with 
these observable determinants of survival; this, they argue, makes it more credible that 
distance is uncorrelated with the unobservable determinants in the error term as well. 

Using 205,021 observations on Americans aged at least 64 who had an AMI in 
1987, McClellan, McNeil, and Newhouse reached a striking conclusion: Their TSLS 
estimates suggest that cardiac catheterization has a small, possibly 0, effect on health 
outcomes; that is, cardiac catheterization does not substantially prolong life. In con- 
trast, the OLS estimates suggest a large positive effect. They interpret this difference 
as evidence of bias in the OLS estimates. 

McClellan, McNeil, and Newhouse’s IV method has an interesting interpretation. 
The OLS analysis used actual treatment as the regressor, but because actual treat- 
ment is itself the outcome of a decision by patient and doctor, they argue that the 
actual treatment is correlated with the error term. Instead, TSLS uses predicted treat- 
ment, where the variation in predicted treatment arises because of variation in the 
instrumental variable: Patients closer to a cardiac catheterization hospital are more 
likely to receive this treatment. 

This interpretation has two implications. First, the IV regression actually esti- 
mates the effect of the treatment not on a “typical” randomly selected patient but 
rather on patients for whom distance is an important consideration in the treatment 
decision. The effect on those patients might differ from the effect on a typical patient, 
which provides one explanation of the greater estimated effectiveness of the treat- 
ment in clinical trials than in McClellan, McNeil, and Newhouse’s IV study. Second, 
it suggests a general strategy for finding instruments in this type of setting: Find an 
instrument that affects the probability of treatment, but does so for reasons that are 
unrelated to the outcome except through their effect on the likelihood of treatment. 
Both these implications have applicability to experimental and “quasi-experimental” 
studies, the topic of Chapter 13. 


Conclusion 


From the humble start of estimating how much less butter people will buy if its price 
rises, IV methods have evolved into a general approach for estimating regressions 
when one or more variables are correlated with the error term. Instrumental vari- 
ables regression uses the instruments to isolate variation in the endogenous regres- 
sors that is uncorrelated with the error in the regression of interest; this is the first 
stage of two stage least squares. This in turn permits estimation of the effect of inter- 
est in the second stage of two stage least squares. 

Successful IV regression requires valid instruments — that is, instruments that are 
both relevant (not weak) and exogenous. If the instruments are weak, then the TSLS 
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estimator can be biased, even in large samples, and statistical inferences based on 
TSLS ¢-statistics and confidence intervals can be misleading. Fortunately, when there 
is a single endogenous regressor, it is possible to check for weak instruments simply 
by checking the first-stage F-statistic. 

If the instruments are not exogenous — that is, if one or more instruments are 
correlated with the error term—the TSLS estimator is inconsistent. If there are more 
instruments than endogenous regressors, instrument exogeneity can be examined by 
using the J-statistic to test the overidentifying restrictions. However, the core assump- 
tion—that there are at least as many exogenous instruments as there are endogenous 
regressors— cannot be tested. It is therefore incumbent on both the empirical analyst 
and the critical reader to use their own understanding of the empirical application to 
evaluate whether this assumption is reasonable. 

The interpretation of IV regression as a way to exploit known exogenous varia- 
tion in the endogenous regressor can be used to guide the search for potential instru- 
mental variables in a particular application. This interpretation underlies much of the 
empirical analysis in the area that goes under the broad heading of program evalua- 
tion, in which experiments or quasi-experiments are used to estimate the effect of 
programs, policies, or other interventions on some outcome measure. A variety of 
additional issues arises in those applications—for example, the interpretation of IV 
results when, as in the cardiac catheterization example, different “patients” might 
have different responses to the same “treatment.” These and other aspects of empiri- 
cal program evaluation are taken up in Chapter 13. 


Summary 


1. Instrumental variables regression is a way to estimate causal coefficients when 
one or more regressors are correlated with the error term. 

2. Endogenous variables are correlated with the error term in the equation of 
interest; exogenous variables are uncorrelated with this error term. 

3. For an instrument to be valid, it must be (1) correlated with the included 
endogenous variable and (2) exogenous. 

4. IV regression requires at least as many instruments as included endogenous 
variables. 

5. The TSLS estimator has two stages. First, the included endogenous variables 
are regressed against the included exogenous variables and the instruments. 
Second, the dependent variable is regressed against the included exogenous 
variables and the predicted values of the included endogenous variables from 
the first-stage regression(s). 

6. Weak instruments (instruments that are nearly uncorrelated with the included 
endogenous variables) make the TSLS estimator biased and TSLS confidence 
intervals and hypothesis tests unreliable. 

7. Ifan instrument is not exogenous, the TSLS estimator is inconsistent. 
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Key Terms 
instrumental variables (IV) exactly identified (438) 

regression (427) overidentified (438) 
instrumental variable (instrument) underidentified (438) 

(427) reduced form (439) 
endogenous variable (428) first-stage regression (440) 
exogenous variable (428) second-stage regression (440) 
instrument relevance condition (429) weak instruments (445) 
instrument exogeneity condition (429) first-stage F-statistic (446) 
two stage least squares (429) test of overidentifying restrictions 
included exogenous variables (437) (448) 
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12.4 


In the demand curve model of Equation (12.3), is In (P?““°") positively or 
negatively correlated with the error, u;? If 6; is estimated by OLS, would you 
expect the estimated value to be larger or smaller than the true value of B,? 
Explain. 


Describe the key characteristics of a valid instrument. If you were a researcher, 
how would you determine if the variable you have selected for an endogenous 
regressor is a valid instrument or not? 


In their study of the effect of institutions on economic development, sup- 
pose Acemoglu et al. had used the prevalence of malaria as an instrument. 
Would this instrument be relevant? Would it be exogenous? Would it be a 
valid instrument? 


In their study of the effectiveness of cardiac catheterization, McClellan, 
McNeil, and Newhouse (1994) used as an instrument the difference in dis- 
tances to cardiac catheterization and regular hospitals. How could you deter- 
mine whether this instrument is relevant? How could you determine whether 
this instrument is exogenous? 
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Exercises 


12.1 This question refers to the panel data IV regressions summarized in Table 12.1. 


12.4 


a. Suppose the federal government is considering a new tax on cigarettes 
that is estimated to increase the retail price by $0.25 per pack. If the cur- 
rent price per pack is $6.75, use the IV regression in column (1) to pre- 
dict the change in demand. Construct a 95% confidence interval for the 
change in demand. 

b. Suppose the United States enters a recession, and income falls by 5%. 
Use the regression in column (1) to predict the change in demand. 

c. Suppose you have additional data on the prices and quantities of cigarettes 
in 1993, 1994, 1996, and 1997 How do you think the estimated coefficients 
would change with an eight-year horizon? With a twelve-year horizon? 

d. Suppose that the F-statistic in column (1) were 63.7 instead of 33.7. 
Would the regression provide a reliable answer to the question posed in 
(a)? Why or why not? 

Consider the regression model with a single regressor: Y; = By + BX; + u; 

Suppose the least squares assumptions in Key Concept 4.3 are satisfied. 


a. Show that X; is a valid instrument. That is, show that Key Concept 12.3 is 
satisfied with Z; = X;. 

b. Show that the IV regression assumptions in Key Concept 12.4 are 
satisfied with this choice of Z, 

c. Show that the IV estimator constructed using Z; = X; is identical to the 
OLS estimator. 

A classmate is interested in estimating the variance of the error term in Equa- 

tion (12.1). 

a. Suppose she uses the estimator from the second-stage regression of 
TSLS: 62 = 45> 75-1(¥% - BRSES — pee where X; is the fitted 
value from the first-stage regression. Is this estimator consistent? (For 


the purposes of this question, suppose that the sample is very large and 
the TSLS estimators are essentially identical to By and £4.) 


b. Iso} = +5>7-1(¥, — BES’ — BTSLSX;)? consistent? 


Consider TSLS estimation of the effect of a single included endogenous vari- 
able, X;, on Y; using one binary instrument, Z;, which takes values of either 0 
or 1. Noting that >’;_,(¥; - Y)(Z; - Z) = >’;_,Z;(¥; — Y), show that the 
Wald estimator can be derived from the TSLS estimator in this circumstance 
to estimate the effect of X; on Y; Bia = (¥z=1 — Yz=0)/(Aiz=1 — Aiz=0) 
where Yz—, equals the mean of values of Y; for which Z; = 1. 


Consider the IV regression model 


Y; = Bo + BX; + B&W; + u; 
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where X; is correlated with u; and Z; is an instrument. Suppose that the first 
three assumptions in Key Concept 12.4 are satisfied. Which IV assumption is 
not satisfied when 


a. Z; is independent of (Y;, X;, W;)? 


b. Zi = W? 
e W, = 1 for alli? 
d. Z; = X;? 


Suppose a researcher is considering developing an IV regression model with 
one regressor, X;, and one instrument, Z;. If she has a sample ofn = 113, what 
range must the correlation coefficient be between X; and Z; in order for Z; to 
be considered a strong instrument? [Hint: See Equation (714).] 


A classmate has developed an IV regression model with one regressor, X;, 
and two instruments, Z,; and Z,;. She has a strong theoretical basis as to why 
corr( Zi; ui) = 0, namely that Z4; is the result of a random lottery. Prelimi- 
nary work, however, showed that the first-stage F-statistic from this exactly 
identified model was insufficiently large for Z,; to be considered a relevant 
instrument set by itself. As a result she includes an additional instrument, Z;,, 
which is strongly relevant but is less likely to satisfy the condition of instrument 
exogeneity. In the instrumental variable regression model with one regressor, 
X; and two instruments, Z,; and Z>;, the value of the J-statistic is J = 7.5. 


a. Does this suggest that E(u; | Zin Zz) # 0? Explain. 
b. Does this suggest that E(u;|Z2;) # 0? Explain. 


Consider a product market with a supply function Qj = By + BP; + uj, 
a demand function Qf = yọ + uf, and a market equilibrium condition 
Q$ = Of where uw and uš are mutually independent i.i.d. random variables, 
both with a mean of 0. 


a. Show that P; and uj are correlated. 
b. Show that the OLS estimator of 6, is inconsistent. 


c. How would you estimate Bp, 61, and yo? 


A researcher is interested in the effect of more secure property rights on 
income across countries. He collects recent data from 60 countries and runs 
the OLS regression Y; = By + 61X; + u;, where Y; is a country’s GDP per 
capita and X; is an index taking values between 0 and 10 reflecting the protec- 
tion against expropriation where a higher value indicates greater protection 
against expropriation, that is, more secure property rights. 


a. Explain why the OLS estimates are likely to be unreliable and indicate 
in which direction they might be biased. (Hint: In which direction does 
causality run in this example?) 
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All of the countries in the researcher’s sample were former colonies. Insti- 
tutions securing property rights could originate from early institutions 
established alongside European settlements. The decision for Europeans 
to settle or otherwise could reflect concerns for mortality among settlers. 
Explain how settler mortality might be used as an instrument to estimate 
the effect of more secure property rights on income across countries. 


12.10 Two classmates are comparing their answers to an assignment. One 


classmate has specified an instrumental variable regression model 


Y; 


= bo + BX; + BW, + u; using Z; as an instrument. The other student has 


specified the same model, but has omitted W;. 


a. 


b. 


The first student says that if Z; and W; are correlated, then the second 
student’s IV estimator will not be consistent. Is the first student right 
about this? 


The second student argues that if in the true model 6 = 0, then their IV 
estimator will be consistent. Is the second student correct in saying this? 


Empirical Exercises 


E12.1 How does fertility affect labor supply? That is, how much does a woman’s 


labor supply fall when she has an additional child? In this exercise, you will 


estimate this effect using data for married women from the 1980 U.S. Census. 


6 


The data are available on the text website, http://www.pearsonglobaleditions 
-com, in the file Fertility and described in the file Fertility_Description. The 
data set contains information on married women aged 21-35 with two or 
more children. 


a. 


C 


Regress weeksworked on the indicator variable morekids, using OLS. On 
average, do women with more than two children work less than women 
with two children? How much less? 


Explain why the OLS regression estimated in (a) is inappropriate for 
estimating the causal effect of fertility (morekids) on labor supply 
(weeksworked). 

The data set contains the variable samesex, which is equal to 1 if the first 
two children are of the same sex (boy—boy or girl-girl) and equal to 0 
otherwise. Are couples whose first two children are of the same sex more 
likely to have a third child? Is the effect large? Is it statistically significant? 
Explain why samesex is a valid instrument for the IV regression of 
weeksworked on morekids. 


These data were provided by Professor William Evans of the University of Maryland and were used in 
his paper with Joshua Angrist, “Children and Their Parents’ Labor Supply: Evidence from Exogenous 
Variation in Family Size,” American Economic Review, 1998, 88(3): 450-477. 
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e. Is samesex a weak instrument? 

f. Estimate the IV regression of weeksworked on morekids, using samesex 
as an instrument. How large is the fertility effect on labor supply? 

g. Do the results change when you include the variables agem1, black, 
hispan, and othrace in the labor supply regression (treating these vari- 
able as exogenous)? Explain why or why not. 


Does viewing a violent movie lead to violent behavior? If so, the incidence of 
violent crimes, such as assaults, should rise following the release of a violent 
movie that attracts many viewers. Alternatively, movie viewing may substitute 
for other activities (such as alcohol consumption) that lead to violent behavior, 
so that assaults should fall when more viewers are attracted to the cinema. On 
the text website, http://www.pearsonglobaleditions.com, you will find the data 
file Movies, which contains data on the number of assaults and movie attendance 
for 516 weekends from 1995 through 2004.’ A detailed description is given in 
Movies_Description, available on the website. The data set includes weekend U.S. 
attendance for strongly violent movies (such as Hannibal), mildly violent movies 
(such as Spider-Man), and nonviolent movies (such as Finding Nemo).The data set 
also includes a count of the number of assaults for the same weekend in a subset 
of counties in the United States. Finally, the data set includes indicators for year, 
month, whether the weekend is a holiday, and various measures of the weather. 


a. i. Regress the logarithm of the number of assaults [/n_assaults = 
In(assaults)| on the year and month indicators. Is there evidence of 
seasonality in assaults? That is, do there tend to be more assaults in 
some months than others? Explain. 

ii. Regress total movie attendance (attend = attend_v + attend_m + 
attend_n) on the year and month indicators. Is there evidence of 
seasonality in movie attendance? Explain. 


b. Regress /n_assaults on attend_v, attend_m, attend_n, the year and month 
indicators, and the weather and holiday control variables available in the 
data set. 


i. Based on the regression, does viewing a strongly violent movie 
increase or decrease assaults? By how much? Is the estimated effect 
statistically significant? 

ii. Does attendance at strongly violent movies affect assaults 
differently than attendance at moderately violent movies? 
Differently than attendance at nonviolent movies? 


These are aggregated versions of data provided by Gordon Dahl of University of California-San Diego 
and Stefano DellaVigna of University of California—Berkeley and were used in their paper “Does Movie 
Violence Increase Violent Crime?” Quarterly Journal of Economics, 2009, 124(2): 677-734. 
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iii. A strongly violent blockbuster movie is released, and the weekend’s 
attendance at strongly violent movies increases by 6 million; mean- 
while, attendance falls by 2 million for moderately violent movies 
and by 1 million for nonviolent movies. What is the predicted effect 
on assaults? Construct a 95% confidence interval for the change 
in assaults. [ Hint: Review Section 73 and material surrounding 
Equations (8.7) and (8.8).] 


c. It is difficult to control for all the variables that affect assaults and that 


might be correlated with movie attendance. For example, the effect of 
the weather on assaults and movie attendance is only crudely approxi- 
mated by the weather variables in the data set. However, the data set 


does include a set of instruments — pr_attend_v, pr_attend_m, and pr_ 
attend_n—that are correlated with attendance but are (arguably) uncor- 
related with weekend-specific factors (such as the weather) that affect 
both assaults and movie attendance. These instruments use historical 
attendance patterns, not information on a particular weekend, to predict a 
film’s attendance in a given weekend. For example, if a film’s attendance 
is high in the second week of its release, then this can be used to predict 
that its attendance was also high in the first week of its release. (The 
details of the construction of these instruments are available in the Dahl 
and DellaVigna paper referenced in footnote 5.) Run the regression from 
(b) (including year, month, holiday, and weather controls) but now using 


pr_attend_v, pr_attend_m, and pr_attend_n as instruments for attend_v, 
attend_m, and attend_n. Use this IV regression to answer (b)(i)—(b) (iii). 


The intuition underlying the instruments in (c) is that attendance in 

a given week is correlated with attendance in surrounding weeks. For 
each movie category, the data set includes attendance in surrounding 
weeks. Run the regression using the instruments attend_v_f, attend_m_f, 
attend_n_f, attend_v_b, attend_m_b, and attend_n_b instead of the instru- 
ments used in (c). Use this IV regression to answer (b)(i)—(b) (iii). 


There are nine instruments listed in (c) and (d), but only three are 
needed for identification. Carry out the test for overidentification sum- 
marized in Key Concept 12.6. What do you conclude about the validity 
of the instruments? 


Based on your analysis, what do you conclude about the effect of violent 
movies on (short-run) violent behavior? 


(This requires Appendix 12.5) On the text website, http://www.pearson- 


globaleditions.com, you will find the data set WeakInstrument, which 


contains 200 observations on (Y; X;, Z;) for the instrumental regression 


Y; 


a. 


= Bo + BX; + u; 


Construct BISE, its standard error, and the usual 95% confidence 
interval for 64. 
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b. Compute the F-statistic for the regression of X; on Z;. Is there evidence 
of a “weak instrument” problem? 


c. Compute a 95% confidence interval for 64, using the Anderson—Rubin 
procedure. (To implement the procedure, assume that —5 = B, = 5.) 


d. Comment on the differences in the confidence intervals in (a) and (c). 
Which is more reliable? 


The Cigarette Consumption Panel Data Set 


The data set consists of annual data for the 48 contiguous U.S. states from 1985 to 1995. Quan- 
tity consumed is measured by annual per capita cigarette sales in packs per fiscal year, as 
derived from state tax collection data. The price is the real (that is, inflation-adjusted) average 
retail cigarette price per pack during the fiscal year, including taxes. Income is real per capita 
income. The general sales tax is the average tax, in cents per pack, due to the broad-based state 
sales tax applied to all consumption goods. The cigarette-specific tax is the tax applied to ciga- 
rettes only. All prices, income, and taxes used in the regressions in this chapter are deflated by 
the Consumer Price Index and thus are in constant (real) dollars. We are grateful to Professor 


Jonathan Gruber of MIT for providing us with these data. 


Derivation of the Formula for the TSLS 
Estimator in Equation (12.4) 


The first stage of TSLS is to regress X; on the instrument Z; by OLS and then compute the 
OLS predicted value £; the second stage is to regress Y; on x by OLS. Accordingly, the for- 
mula for the TSLS estimator, expressed in terms of the predicted value x is the formula for 
the OLS estimator in Key Concept 4.2, with x replacing X;. That is, BESES =S s% where $ 
is the sample variance of X; and Sey is the sample covariance between Y; and Xj. 

Because X; is the predicted value of X; from the first-stage regression, X; = 7% + MZ; 
the definitions of sample variances and covariances imply that sp, = TSzy and s = THs? 
(Exercise 12.4). Thus, the TSLS estimator can be written as BESES = Soy [$ = szy/(®s3). 
Finally, 7, is the OLS slope coefficient from the first stage of TSLS, so 7, = s7y/s%. Substitu- 
tion of this formula for 7, into the formula ÊT SLS = s$zy/(î®8s%) yields the formula for the TSLS 


estimator in Equation (12.4). 
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Large-Sample Distribution of the TSLS Estimator 


This appendix studies the large-sample distribution of the TSLS estimator in the case consid- 
ered in Section 12.1 —that is, with a single instrument, a single included endogenous variable, 
and no included exogenous variables. 

To start, we derive a formula for the TSLS estimator in terms of the errors; this formula 
forms the basis for the remaining discussion, similar to the expression for the OLS estimator 
in Equation (4.28) in Appendix 4.3. 

From Equation (12.1), Y, — Y = B,(X; — X) + (u; — u). Accordingly, the sample cova- 


riance between Z and Y can be expressed as 


it i 
w= È (Zi- Z-Y) 
i < = 
= (Zi - Z) (Bi - X) + (u — u)] 
n i=1 
1 n __ 
Piszx | 2 (Zi Z) (uj — u) 
ae = 
Piszx + 7 (Zi Zus (12.19) 


where szy = [1/(n — 1)]>%-,(Z; — Z)(X; — X) and where the final equality follows 
because }’}-,(Z; — Z) = 0. Substituting the definition of szy and the final expression in 
Equation (12.19) into the definition of BIS“S and multiplying the numerator and denominator 


by (n — 1)/n yields 


n 


1 = 
; H > (Z: — Z)u; 
BS = B 4 Z . (12.20) 


n 


zia- Z)(X; - X) 


ZlrR 


Large-Sample Distribution of ÊI! When the IV 
Regression Assumptions in Key Concept 12.4 Hold 


Equation (12.20) for the TSLS estimator is similar to Equation (4.28) in Appendix 4.3 for the 
OLS estimator except that Z rather than X appears in the numerator and that the denomina- 
tor is the covariance between Z and X rather than the variance of X. Because of these similari- 
ties and because Z is exogenous, the argument in Appendix 4.3 that the OLS estimator is 
normally distributed in large samples extends to eo 

Specifically, when the sample is large, Z = uz, so the numerator is approximately 
q= (7) S qi, Where q; = (Z; — uz)u;. Because the instrument is exogenous, E(q;) = 0. 
By the IV regression assumptions in Key Concept 12.4, q; is i.i.d. with variance 
o, = yar| (Z; — uz)u;]. It follows that var (q) = oz = ojn, and, by the central limit theo- 
rem, q / oz is, in large samples, distributed N(0, 1). 
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Because the sample covariance is consistent for the population covariance, 
szy —> cov(Z;, X;), which, because the instrument is relevant, is nonzero. Thus, by Equation 
(12.20), PPS = B+ 9G /cov(Z;, X), so in large samples BISES is approximately distributed 
N (61,0751), Where TZ rsrs = ø4/[cov( Z; X;) ]? = (1/n)var[(Z; — wz)ui]/[cov(Z;, X) J’, 


which is the expression given in Equation (12.8). 


Large-Sample Distribution of the TSLS 
Estimator When the Instrument Is Not Valid 


This appendix considers the large-sample distribution of the TSLS estimator in the setup of 
Section 12.1 (one X, one Z) when one or the other of the conditions for instrument validity 
fails. If the instrument relevance condition fails, the large-sample distribution of the TSLS 
estimator is not normal; in fact, its distribution is that of a ratio of two normal random vari- 


ables. If the instrument exogeneity condition fails, the TSLS estimator is inconsistent. 


Large-Sample Distribution of 15! When the 
Instrument Is Weak 


First consider the case that the instrument is irrelevant, so that cov(Z;, X;) = 0.Then the argu- 
ment in Appendix 12.3 entails division by 0. To avoid this problem, we need to take a closer 
look at the behavior of the term in the denominator of Equation (12.20) when the population 
covariance is 0. 

We start by rewriting Equation (12.20). Because of the consistency of the sample average, 
in large samples Z is close to uz, and X is close to wy. Thus the term in the denominator of 
Equation (12.20) is approximately (7) >"\_,(Z; — wz) (X; — wx) = (4) D%_1 17; = F, where 
ri = (Zi — wz) (X% — ux). Leto; = var[ (Z; — wz) (X; — wx) ],let oF = o7/n,and let 9, 04, 
and oF be as defined in Appendix 12.3. Then Equation (12.20) implies that, in large samples, 


apes Zags (QUA EE ea 


If the instrument is irrelevant, then E(r;) = cov(Z;, X;) = 0. Thus 7 is the sample average of 


the random variables r,i = 1,...,m, which are i.i.d. (by the second least squares assumption), 
have variance g? = var[ (Z; — wz)(X; — wx) ] (which is finite by the third IV regression 
assumption), and have a mean of 0 (because the instruments are irrelevant). It follows that the 
central limit theorem applies to F; specifically, r /o; is approximately distributed N(0, 1). There- 
fore, the final expression of Equation (12.21) implies that, in large samples, the distribution of 
Boe — B, is the distribution of aS, where a = o,/0, and S is the ratio of two random vari- 
ables, each of which has a standard normal distribution (these two standard normal random 


variables are correlated). 
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In other words, when the instrument is irrelevant, the central limit theorem applies to the 
denominator as well as the numerator of the TSLS estimator, so in large samples the distribu- 
tion of the TSLS estimator is the distribution of the ratio of two normal random variables. 
Because X; and u; are correlated, these normal random variables are correlated, and the large- 
sample distribution of the TSLS estimator when the instrument is irrelevant is complicated. In 
fact, the large-sample distribution of the TSLS estimator with irrelevant instruments is cen- 
tered on the probability limit of the OLS estimator. Thus when the instrument is irrelevant, 
TSLS does not eliminate the bias in OLS and, moreover, has a nonnormal distribution, even 
in large samples. 

A weak instrument represents an intermediate case between an irrelevant instrument and 
the normal distribution derived in Appendix 12.3. When the instrument is weak but not irrel- 
evant, the distribution of the TSLS estimator continues to be nonnormal, so the general lesson 


here about the extreme case of an irrelevant instrument carries over to weak instruments. 


Large-Sample Distribution of p> 


When the Instrument Is Endogenous 


The numerator in the final expression in Equation (12.20) converges in probability to 
cov(Z;, u;). If the instrument is exogenous, this is 0, and the TSLS estimator is consistent 
(assuming that the instrument is not weak). If, however, the instrument is not exogenous, then, 
if the instrument is not weak, B/S45 ——> g, + cov( Z; u;)/cov(Z;, X;) # B,. That is, if the 


instrument is not exogenous, the TSLS estimator is inconsistent. 


Instrumental Variables Analysis with Weak 
Instruments 


This appendix discusses some methods for instrumental variables analysis in the presence of 
potentially weak instruments. The appendix focuses on the case of a single included endoge- 
nous regressor [Equations (12.13) and (12.14)]. 


Testing for Weak Instruments 


The rule of thumb in Key Concept 12.5 is that a first-stage F-statistic less than 10 indicates that the 
instruments are weak. One motivation for this rule of thumb arises from an approximate expres- 
sion for the bias of the TSLS estimator. Let B?“* denote the probability limit of the OLS estimator 
Bı, and let B?“5 — B, denote the asymptotic bias of the OLS estimator (if the regressor is endog- 
enous, then Ê —2+ pels = B,).Itis possible to show that, when there are many instruments, the 
bias of the TSLS estimator is approximately E(B") — Bı ~ (pes — B) /[E(F) - 1], 
where E(F) is the expectation of the first-stage F-statistic. If E(F) = 10,then the bias of TSLS 


relative to the bias of OLS is approximately 1/9, or just over 10%, which is small enough to be 
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acceptable in many applications. Replacing E(F) > 10 with F > 10 yields the rule of thumb 
in Key Concept 12.5. 

The motivation in the previous paragraph involved an approximate formula for the bias 
of the TSLS estimator when there are many instruments. In most applications, however, the 
number of instruments, m, is small. Stock and Yogo (2005) provide a formal test for weak 
instruments that avoids the approximation that m is large. In the Stock—Yogo test, the null 
hypothesis is that the instruments are weak, and the alternative hypothesis is that the instru- 
ments are strong, where strong instruments are defined to be instruments for which the bias 
of the TSLS estimator is at most 10% of the bias of the OLS estimator. The test entails compar- 
ing the first-stage F-statistic (for technical reasons, the homoskedasticity-only version) to a 
critical value that depends on the number of instruments. As it happens, for a test with a 5% 
significance level, this critical value ranges between 9.08 and 11.52, so the rule of thumb of 


comparing F to 10 is a good approximation to the Stock—Yogo test. 


Hypothesis Tests and Confidence Sets for B 


If the instruments are weak, the TSLS estimator is biased and has a nonnormal distribution. 
Thus the TSLS t-test of B, = 619 is unreliable, as is the TSLS confidence interval for £4. There 
are, however, other tests of 8, = 610, along with confidence intervals based on those tests, that 
are valid whether instruments are strong, weak, or even irrelevant. When there is a single 
endogenous regressor, the preferred test is Moreira’s (2003) conditional likelihood ratio 
(CLR) test. An older test, which works for any number of endogenous regressors, is based on 
the Anderson—Rubin (1949) statistic. Because the Anderson—Rubin statistic is conceptually 
less complicated, we describe it first. 

The Anderson—Rubin test of 6, = £o proceeds in two steps. In the first step, compute a 
new variable, Y} = Y, — Bı. 0X; In the second step, regress Y; against the included exogenous 
regressors (W’s) and the instruments (Z’s). The Anderson—Rubin statistic is the F-statistic 
testing the hypothesis that the coefficients on the Z’s are all 0. Under the null hypothesis that 
Bı = Biv, if the instruments satisfy the exogeneity condition (condition 2 in Key Concept 
12.3), they will be uncorrelated with the error term in this regression, and the null hypothesis 
will be rejected in 5% of all samples. 

As discussed in Sections 3.3 and 74, a confidence set can be constructed as the set of 
values of the parameters that are not rejected by a hypothesis test. Accordingly, the set of 
values of & that are not rejected by a 5% Anderson—Rubin test constitutes a 95% confidence 
set for 61. When the Anderson—Rubin F-statistic is computed using the homoskedasticity-only 
formula, the Anderson—Rubin confidence set can be constructed by solving a quadratic equa- 
tion (see Empirical Exercise 12.3). The logic behind the Anderson—Rubin statistic never 
assumes instrument relevance, and the Anderson—Rubin confidence set will have a coverage 
probability of 95% in large samples, whether the instruments are strong, weak, or even 
irrelevant. 

The CLR statistic also tests the hypothesis that B, = £; o. Likelihood ratio statistics com- 
pare the value of the likelihood (see Appendix 11.2) under the null hypothesis to its value 


under the alternative and reject it if the likelihood under the alternative is sufficiently greater 
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than under the null. Familiar test statistics in this text, such as the homoskedasticity-only 
F-statistic in multiple regression, can be derived as likelihood ratio statistics under the assump- 
tion of homoskedastic normally distributed errors. Unlike any of the other tests discussed in 
this text, however, the critical value of the CLR test depends on the data—specifically, on a 
statistic that measures the strength of the instruments. By using the right critical value, the 
CLR test is valid whether instruments are strong, weak, or irrelevant. CLR confidence inter- 
vals can be computed as the set of values of £; that are not rejected by the CLR test. 

The CLR test is equivalent to the TSLS t-test when instruments are strong and has very 
good power when instruments are weak. With suitable software, the CLR test is easy to use. 
The disadvantage of the CLR test is that it does not generalize readily to more than one 
endogenous regressor. In that case, the Anderson—Rubin test (and confidence set) is recom- 
mended; however, when instruments are strong (so TSLS is valid) and the coefficients are 
overidentified, the Anderson—Rubin test is inefficient in the sense that it is less powerful than 
the TSLS t-test. 


Estimation of B 


If the instruments are irrelevant, then without further restrictions it is not possible to obtain 
an unbiased estimator of £4, even in large samples. With weak instruments, CLR or Anderson- 
Rubin confidence intervals for the coefficients are preferable to point estimation. 

The problems of estimation, testing, and confidence intervals in IV regression with weak 
instruments constitute an area of ongoing research. To learn more about this topic, visit the 


website for this text. 


TSLS with Control Variables 


In Key Concept 12.4, the W variables are assumed to be exogenous. This appendix considers 
the case in which W is not exogenous but instead is a control variable included to make Z 
exogenous. The logic of control variables in TSLS parallels the logic in OLS: If a control vari- 
able effectively controls for an omitted factor, then the instrument is uncorrelated with the 
error term. Because the control variable is correlated with the error term, the coefficient on a 
control variable does not have a causal interpretation. The mathematics of control variables 
in TSLS also parallels the mathematics of control variables in OLS and entails relaxing the 
assumption that the error has conditional mean 0 given Z and W to be that the conditional 
mean of the error does not depend on Z. This appendix draws on Appendix 6.5 (OLS with 
control variables), which should be reviewed first. 


Consider the IV regression model in Equation (12.12) with a single X and a single W: 


Y; = Bo + BX, + BW; + uj. (12.22) 
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We replace IV regression assumption 1 in Key Concept 12.4 [which states that E(u;|W,) = 0] 


with the assumption that, conditional on W,, the mean of u; does not depend on Z; 
E(u;|W,;, Zi) = E(u;|W,). (12.23) 
The next steps in the argument parallel those for regression with control variables in 


Equations (6.23)—(6.25) in Appendix 6.5. Assume that E(u;|W,) is linear in W;, so 
E(u;|W,) = yo + yıW, where yo and y, are coefficients. Then 


Y, = By + BX; + BW, + u; — E(u;|W, Zi) + E(u;|W, Z;) 


12.24 
= Bo + BX; + BW, + & + Yo + yW; ( ) 


where the first line adds and subtracts E (u;| W, Z;) to the right hand side of Equation (12.22), 
and the second line and defines s; = u; — E(u;|W,, Z;) and uses the conditional mean inde- 
pendence assumption plus linearity to write E(u;|W, Zi) = E(u;|W;) = yo + y:W,. We thus 
have that, 


Y; = ĉo t BX; t ôW; t Ei, (12.25) 


where 5) = fo + Yo and 6; = B, + y1. Now E(e;| W, Zi) = E[u; — E(u; | W, Zi) |W, Zi] = 
E(u;|W, Zi) — E(u;|W,, Zi) = 0, which in turn implies corr(Z;, e;) = 0. Thus IV regression 
assumption 1 and the instrument exogeneity requirement (condition 2 in Key Concept 12.3) 
both hold for Equation (12.24) with error term s;, Thus, if IV regression assumption 1 is replaced 
by conditional mean independence in Equation (12.23), the original IV regression assumptions 
in Key Concept 12.4 apply to the modified regression in Equation (12.25). 

Because the IV regression assumptions of Key Concept 12.4 hold for Equation (12.25), 
all the methods of inference (for both weak and strong instruments) discussed in this chapter 
apply to Equation (12.25). In particular, if the instruments are strong, the coefficients in Equa- 
tion (12.25) will be estimated consistently by TSLS and TSLS tests, and confidence intervals 
will be valid. 

Just as in OLS with control variables, in general the TSLS coefficient on the control vari- 
able W does not have a causal interpretation. TSLS consistently estimates ô; in Equation 
(12.25), but 6, is the sum of f, the direct causal effect of W, and y1, which reflects the correla- 
tion between W and the omitted factors in u; for which W controls. 

In the cigarette consumption regressions in Table 12.1, it is tempting to interpret the coef- 
ficient on the 10-year change in log income as the income elasticity of demand. If, however, 
income growth is correlated with increases in education and if more education reduces smok- 
ing, income growth would have its own causal effect (£, the income elasticity) plus an effect 
arising from its correlation with education (y,). If the latter effect is negative (y4 < 0), the 
income coefficients in Table 12.1 (which estimate 6; = B) + y1) would underestimate the 
income elasticity. As long as the conditional mean independence assumption in Equation 
(12.23) holds, however, the TSLS estimator of the price elasticity is consistent, even if the 


estimate of the income elasticity is not. 
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3 Quasi-Experiments 


| n many fields, such as psychology and medicine, causal effects are commonly 
estimated using experiments. Before being approved for widespread medical use, 
for example, a new drug must be subjected to experimental trials in which some 
patients are randomly selected to receive the drug while others are given a harmless 
ineffective substitute (a placebo); the drug is approved only if this randomized 
controlled experiment provides convincing statistical evidence that the drug is safe 
and effective. 

There are three reasons to study randomized controlled experiments in an 
econometrics course. First, an ideal randomized controlled experiment provides a 
conceptual benchmark against which to judge estimates of causal effects made with 
observational data. Second, the results of randomized controlled experiments, when 
conducted, can be very influential, so it is important to understand the limitations and 
threats to validity of actual experiments, as well as their strengths. Third, external 
circumstances sometimes produce what appears to be randomization; that is, because of 
external events, the treatment of some individual occurs “as if” it is random, possibly con- 
ditional on some control variables. This “as if” randomness produces a quasi-experiment 
or natural experiment, and many of the methods developed for analyzing randomized 
experiments can be applied (with some modifications) to quasi-experiments. 

This chapter examines experiments and quasi-experiments in economics. The 
statistical tools used in this chapter are multiple regression analysis, regression analysis 
of panel data, and instrumental variables (IV) regression. What distinguishes the 
discussion in this chapter is not the tools used but rather the type of data analyzed and 
the special opportunities and challenges posed when analyzing experiments and 
quasi-experiments. 

The methods developed in this chapter are often used for evaluating social or 
economic programs. Program evaluation is the field of study that concerns 
estimating the effect of a program, policy, or some other intervention or “treatment.” 
What is the effect on earnings of going through a job training program? What is the 
effect on employment of low-skilled workers of an increase in the minimum wage? 
What is the effect on college attendance of making low-cost student aid loans 
available to middle-class students? This chapter discusses how such programs or 
policies can be evaluated using experiments or quasi-experiments. 

We begin in Section 13.1 by elaborating on the discussions in Chapters 1, 3, and 4 
of the estimation of causal effects using randomized controlled experiments. In reality, 
actual experiments with human subjects encounter practical problems that constitute 
threats to their internal and external validity; these threats and some econometric 
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tools for addressing them are discussed in Section 13.2. Section 13.3 analyzes an 
important randomized controlled experiment in which elementary students were 
randomly assigned to different-sized classes in the state of Tennessee in the late 1980s. 
Section 13.4 turns to the estimation of causal effects using quasi-experiments. Threats 
to the validity of quasi-experiments are discussed in Section 13.5. One issue that arises in 
both experiments and quasi-experiments is that treatment effects can differ from one 
member of the population to the next, and the matter of interpreting the resulting esti- 
mates of causal effects when the population is heterogeneous is taken up in Section 13.6. 


Potential Outcomes, Causal Effects, 
and Idealized Experiments 


This section explains how the population mean of individual-level causal effects can 
be estimated using a randomized controlled experiment and how data from such an 
experiment can be analyzed using multiple regression analysis. 


Potential Outcomes and the Average Causal Effect 


Suppose that you are considering taking a drug for a medical condition, enrolling in 
a job training program, or doing an optional econometrics problem set. It is reason- 
able to ask, What are the benefits of doing so—receiving the treatment —for me? You 
can imagine two hypothetical situations, one in which you receive the treatment and 
one in which you do not. Under each hypothetical situation, there would be a mea- 
surable outcome (the progress of the medical condition, getting a job, your econo- 
metrics grade). The difference in these two potential outcomes would be the causal 
effect, for you, of the treatment. 

More generally, a potential outcome is the outcome for an individual under a 
potential treatment. The causal effect for that individual is the difference in the 
potential outcome if the treatment is received and the potential outcome if it is not. 
In general, the causal effect can differ from one individual to the next. For example, 
the effect of a drug could depend on your age, whether you smoke, or other health 
conditions. The problem is that there is no way to measure the causal effect for a 
single individual: Because the individual either receives the treatment or does not, 
one of the potential outcomes can be observed—but not both. 

Although the causal effect cannot be measured for a single individual, in many appli- 
cations it suffices to know the mean causal effect in a population. For example, a job 
training program evaluation might trade off the average expenditure per trainee against 
average trainee success in finding a job. The mean of the individual causal effects in the 
population under study is called the average causal effect or the average treatment effect. 

The average causal effect for a given population can be estimated, at least in 
theory, using an ideal randomized controlled experiment. To see how, first suppose 
that the subjects are selected at random from the population of interest. Because the 
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subjects are selected by simple random sampling, their potential outcomes, and thus 
their causal effects, are drawn from the same distribution, so the expected value of 
the causal effect in the sample is the average causal effect in the population. Next 
suppose that subjects are randomly assigned to the treatment or the control group. 
Because an individual’s treatment status is randomly assigned, it is distributed inde- 
pendently of his or her potential outcomes. Thus the expected value of the outcome 
for those treated minus the expected value of the outcome for those not treated 
equals the expected value of the causal effect. Thus when the concept of potential 
outcomes is combined with (1) random selection of individuals from a population 
and (2) random experimental assignment of treatment to those individuals, the 
expected value of the difference in outcomes between the treatment and control 
groups is the average causal effect in the population. That is, as was stated in 
Section 3.5, the average causal effect on Y; of treatment (X; = 1) versus no 
treatment (X;=0) is the difference in the conditional expectations, 
E(Y|X; = 1) — E(Y|X; = 0), where E(Y,|X; = 1) and E(Y|X; = 0) are, respec- 
tively, the expected values of Y for the treatment and control groups in an ideal 
randomized controlled experiment. Appendix 13.3 provides a mathematical treat- 
ment of the foregoing reasoning. 

In general, an individual causal effect can be thought of as depending both on 
observable variables and on unobservable variables. We have already encountered 
the idea that a causal effect can depend on observable variables; for example, 
Chapter 8 examined the possibility that the effect of a class size reduction might 
depend on whether a student is an English learner. Through Section 13.5, we consider 
the case that variation in causal effects depends only on observable variables. 
Section 13.6 takes up the case that causal effects depend on unobserved variables. 


Econometric Methods for Analyzing Experimental Data 


Data from a randomized controlled experiment can be analyzed by comparing dif- 
ferences in means or by a regression that includes the treatment indicator and addi- 
tional control variables. This latter specification, the differences estimator with 
additional regressors, can also be used in more complicated randomization schemes, 
in which the randomization probabilities depend on observable covariates. 


The differences estimator. The differences estimator is the difference in the sample 
averages for the treatment and control groups (Section 3.5), which can be computed 
by regressing the outcome variable Y on a binary treatment indicator X: 


Y, = By + BX, + wu, i= 1,...,n. (13.1) 
As discussed in Section 4.4, if X is randomly assigned, then E(u;|X;) = 0, and the 


OLS estimator of the causal effect 8; in Equation (13.1) is an unbiased and consistent 
estimator of the causal effect. 
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The differences estimator with additional regressors. The efficiency of the differ- 
ence estimator often can be improved by including some control variables W in the 
regression; doing so leads to the differences estimator with additional regressors: 


Y; = Bo + BX, + BW +++: + Bi4-Wi + up i= 1,..., n. (13.2) 


If W helps to explain the variation in Y, then including W reduces the standard error 
of the regression and, typically, the standard error of ĝi. As discussed in Section 75 
and Appendix 6.5, for the estimator Bi of the causal effect 6, in Equation (13.2) to 
be unbiased, the control variables W must be such that u; satisfies conditional mean 
independence; that is, E(u;|X;,W,) = E(u;|W,). This condition is satisfied if W, are 
pretreatment individual characteristics, such as sex: If W; is a pretreatment character- 
istic and X; is randomly assigned, then X; is independent of u; and W;, so 
E(u;|X;,W,) = E(u;|W,). The W regressors in Equation (13.2) should not include 
experimental outcomes (X; is not randomly assigned, given an experimental out- 
come). As usual with control variables under conditional mean independence, the 
coefficients on the control variables do not have a causal interpretation. 


Estimating causal effects that depend on observables. As discussed in Chapter 8, 
variation in causal effects that depends on observables can be estimated by including 
suitable nonlinear functions of, or interactions with, X;. For example, if W,; is a binary 
indicator denoting sex, then distinct causal effects for men and women can be esti- 
mated by including the interaction variable W,; X X; in the regression in Equation (13.2). 


Randomization based on covariates. Randomization in which the probability of 
assignment to the treatment group depends on one or more observable variables W 
is called randomization based on covariates. If randomization is based on covariates, 
then in general the differences estimator based on Equation (13.1) suffers from omit- 
ted variable bias. For example, consider a hypothetical experiment to estimate the 
causal effect of mandatory versus optional homework in an econometrics course. 
Suppose that there is random assignment, but economics majors (W; = 1) are 
assigned to the treatment group (mandatory homework, X; = 1) with higher prob- 
ability than nonmajors (W; = 0). If majors tend to do better in the course than 
nonmajors anyway, then there is omitted variable bias because being in the treatment 
group is correlated with the omitted variable, being a major. 

Because X; is randomly assigned given W, this omitted variable bias can be elimi- 
nated by using the differences estimator with the additional control variable W,. The 
random assignment of X; given W; implies that, given W,, the mean of u; does not 
depend on X; that is, E(u;|X;,W;) = E(u;|W;).Thus if the treatment effect is the same 
for majors and nonmajors, the first least squares assumption for causal inference with 
control variables (Key Concept 6.6) is satisfied, and the OLS estimator By in Equation 
(13.2) is an unbiased estimator of the causal effect when_X; is assigned randomly based 
on W, If the treatment effect is different for majors and nonmajors, then the interaction 
term X; X W; needs to be added to Equation (13.2), and with this addition, the first 
least squares assumption for causal inference with control variables is satisfied. 


478 


CHAPTER 13 Experiments and Quasi-Experiments 


13.2 


Threats to Validity of Experiments 


Recall from Key Concept 9.1 that a statistical study is internally valid if the statistical 
inferences about causal effects are valid for the population being studied; it is exter- 
nally valid if its inferences and conclusions can be generalized from the population 
and setting studied to other populations and settings. Various real-world problems 
pose threats to the internal and external validity of the statistical analysis of actual 
experiments with human subjects. 


Threats to Internal Validity 


Threats to the internal validity of randomized controlled experiments include failure 
to randomize, failure to follow the treatment protocol, attrition, experimental effects, 
and small sample sizes. 


Failure to randomize. If the treatment is not assigned randomly but instead is based 
in part on the characteristics or preferences of the subject, then experimental out- 
comes will reflect both the effect of the treatment and the effect of the nonrandom 
assignment. For example, suppose that participants in a job training program experi- 
ment are assigned to the treatment group depending on whether their last name falls 
in the first or second half of the alphabet. Because of ethnic differences in last names, 
ethnicity could differ systematically between the treatment and control groups. To 
the extent that work experience, education, and other labor market characteristics 
differ by ethnicity, there could be systematic differences between the treatment and 
control groups in these omitted factors that affect outcomes. In general, nonrandom 
assignment can lead to correlation between X; and u;in Equations (13.1) and (13.2), 
which in turn leads to bias in the estimator of the treatment effect. 

It is possible to test for randomization. If treatment is randomly received, then 
X; will be uncorrelated with observable pretreatment individual characteristics W. 
Thus a test for random receipt of treatment entails testing the hypothesis that the 
coefficients on Wi; . . . , W,; are 0 in a regression of X; on Wi; . . . , W,;. In the job train- 
ing program example, regressing receipt of job training (X;) on sex, race, and prior 
education (W’s) and then computing the F-statistic testing whether the coefficients 
on the W’s are 0 provides a test of the null hypothesis that treatment was randomly 
received against the alternative hypothesis that receipt of treatment depended on 
sex, race, or prior education. If the experimental design performs randomization 
conditional on covariates, then those covariates would be included in the regression, 
and the F-test would test the coefficients on the remaining W’s.' 


‘Tn this example, X; is binary, so, as discussed in Chapter 11, the regression of X; on W,;,..., W; is a linear 
probability model, and heteroskedasticity-robust standard errors are essential. Another way to test the 
hypothesis that E(X;|W,;,..., W,;) does not depend on W,;,..., W; when X; is binary is to use a probit 
or logit model (see Section 11.2). 
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Failure to follow the treatment protocol. In an actual experiment, people do not 
always do what they are told. In a job training program experiment, for example, 
some of the subjects assigned to the treatment group might not show up for the train- 
ing sessions and thus not receive the treatment. Similarly, subjects assigned to the 
control group might somehow receive the training anyway, perhaps by making a 
special request to an instructor or administrator. 

The failure of individuals to follow completely the randomized treatment protocol 
is called partial compliance with the treatment protocol. Suppose that the experimenter 
knows whether the treatment was actually received (for example, whether the trainee 
attended class), and the treatment actually received is recorded as X;. With partial 
compliance, there is an element of choice in whether the subject receives the treatment, 
so X; can be correlated with u; even if initially there is random assignment. Thus failure 
to follow the treatment protocol leads to bias in the OLS estimator. 

If there are data on both treatment actually received (X;) and the initial random 
assignment, then the treatment effect can be estimated by instrumental variables 
regression. Instrumental variables estimation of the treatment effect entails the esti- 
mation of Equation (13.1)—or Equation (13.2) if there are control variables —using 
the initial random assignment (Z;) as an instrument for the treatment actually 
received (X;). Recall that a variable must satisfy the two conditions of instrument 
relevance and instrument exogeneity (Key Concept 12.3) to be a valid instrumental 
variable. As long as the protocol is partially followed, then the actual treatment level 
is partially determined by the assigned treatment level, so the instrumental variable 
Z; is relevant. If initial assignment is random, then Z; is distributed independently of 
u; (conditional on W, if randomization is conditional on covariates), so the instru- 
ment is exogenous. Thus in an experiment with randomly assigned treatment, partial 
compliance, and data on actual treatment, the original random assignment is a valid 
instrumental variable. 


Attrition. Attrition refers to subjects dropping out of the study after being randomly 
assigned to the treatment or the control group. Sometimes attrition occurs for rea- 
sons unrelated to the treatment program; for example, a participant in a job training 
study might need to leave town to care for a sick relative. But if the reason for attri- 
tion is related to the treatment itself, then the attrition can result in bias in the OLS 
estimator of the causal effect. For example, suppose that the most able trainees drop 
out of the job training program experiment because they get out-of-town jobs 
acquired using the job training skills, so at the end of the experiment only the least 
able members of the treatment group remain. Then the distribution of unmeasured 
characteristics (ability) will differ between the control and treatment groups (the 
treatment enabled the ablest trainees to leave town). In other words, the treatment 
X; will be correlated with u; (which includes ability) for those who remain in the 
sample at the end of the experiment, so the differences estimator will be biased. 
Because attrition results in a nonrandomly selected sample, attrition that is related 
to the treatment leads to selection bias (Key Concept 9.4). 
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The Hawthorne Effect 


Da the 1920s and 1930s, the General 


Electric Company conducted a series of 
studies of worker productivity at its Hawthorne 
plant. In one set of experiments, the research- 
ers varied lightbulb wattage to see how lighting 
affected the productivity of women assembling 
electrical parts. In other experiments, they 
increased or decreased rest periods, changed the 
workroom layout, and shortened workdays. Influ- 
ential early reports on these studies concluded that 
productivity continued to rise whether the lights 
were dimmer or brighter, whether workdays were 
longer or shorter, or whether conditions improved 
or worsened. Researchers concluded that the pro- 
ductivity improvements were not the consequence 


of changes in the workplace but instead came 


about because their special role in the experiment 
made the workers feel noticed and valued, so they 
worked harder and harder. Over the years, the 
idea that being in an experiment influences subject 
behavior has come to be known as the Hawthorne 
effect. 

But there is a glitch to this story: Careful exami- 
nation of the actual Hawthorne data reveals no 
Hawthorne effect (Gillespie, 1991; Jones, 1992)! 
Still, in some experiments, especially ones in which 
the subjects have a stake in the outcome, merely 
being in an experiment could affect behavior. The 
Hawthorne effect and experimental effects more 
generally can pose threats to internal validity —even 
though the Hawthorne effect is not evident in the 


original Hawthorne data. 


Experimental effects. In experiments with human subjects, merely because the sub- 
jects are in an experiment can change their behavior, a phenomenon sometimes 
called the Hawthorne effect (see the box “The Hawthorne Effect”). 

In some experiments, a “double-blind” protocol can mitigate the effect of being 
in an experiment: Although subjects and experimenters both know that they are in 
an experiment, neither knows whether a subject is in the treatment group or the 
control group. In a medical drug experiment, for example, sometimes the drug and 
the placebo can be made to look the same so that neither the medical professional 
dispensing the drug nor the patient knows whether the administered drug is the real 
thing or the placebo. If the experiment is double-blind, then both the treatment and 
control groups should experience the same experimental effects, so different out- 
comes between the two groups can be attributed to the drug. 

Double-blind experiments are often infeasible in real-world experiments in eco- 
nomics: Both the experimental subject and the instructor know whether the subject 
is attending the job training program. In a poorly designed experiment, this experi- 
mental effect could be substantial. For example, teachers in an experimental program 
might try especially hard to make the program a success if they think their future 
employment depends on the outcome of the experiment. Deciding whether experi- 
mental results are biased because of the experimental effects requires making judg- 
ments based on details of how the experiment was conducted. 
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Small sample sizes. Because experiments with human subjects can be expensive, 
sometimes the sample size is small. A small sample size does not bias estimators of 
the causal effect, but it does mean that the causal effect is estimated imprecisely. A 
small sample also raises threats to the validity of confidence intervals and hypothesis 
tests. Because inference based on normal critical values and heteroskedasticity- 
robust standard errors is justified using large-sample approximations, experimental 
data with small samples are sometimes analyzed under the assumption that the 
errors are normally distributed (Sections 3.6 and 5.6); however, the assumption of 
normality is typically as dubious for experimental data as it is for observational data. 


Threats to External Validity 


Threats to external validity compromise the ability to generalize the results of the 
study to other populations and settings. 


Nonrepresentative sample. The population studied and the population of interest 
must be sufficiently similar to justify generalizing the experimental results. If a job 
training program is evaluated in an experiment with former prison inmates, then it 
might be possible to generalize the study results to other former prison inmates. 
Because a criminal record weighs heavily on the minds of potential employers, how- 
ever, the results might not generalize to workers who have never committed a crime. 


Nonrepresentative program or policy. The policy or program of interest must be 
sufficiently similar to the program studied to permit generalizing the results. A pro- 
gram studied in a small-scale, tightly monitored experiment could be quite different 
from the program actually implemented. If the program actually implemented is 
widely available, then the scaled-up program might not provide the same quality 
control as the experimental version or might be funded at a lower level; either pos- 
sibility could result in the full-scale program being less effective than the smaller 
experimental program. Another difference between an experimental program and 
an actual program might be its duration: The experimental program lasts only for the 
length of the experiment, whereas the actual program under consideration might be 
available for longer periods of time. 


General equilibrium effects. An issue related to scale and duration concerns what 
economists call general equilibrium effects. Turning a small, temporary experimental 
program into a widespread, permanent program might change the economic environ- 
ment sufficiently that the results from the experiment cannot be generalized. A small, 
experimental job training program, for example, might supplement training by 
employers, but if the program were made widely available, it could displace employer- 
provided training, thereby reducing the net benefits of the program. An internally 
valid small experiment might correctly measure a causal effect, holding constant the 
market or policy environment, but general equilibrium effects mean that these other 
factors are not, in fact, held constant when the program is implemented broadly. 
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13.3 


Experimental Estimates of the Effect 
of Class Size Reductions 


In this section, we return to a question addressed in Part I: What is the effect on test 
scores of reducing class size in the early grades? In the late 1980s, Tennessee con- 
ducted a large, multimillion-dollar randomized controlled experiment to ascertain 
whether class size reduction was an effective way to improve elementary education. 
The results of this experiment have strongly influenced our understanding of the 
effect of class size reductions. 


Experimental Design 


The Tennessee class size reduction experiment, known as Project STAR (Student-— 
Teacher Achievement Ratio), was a 4-year experiment designed to evaluate the effect 
on learning of small class sizes. Funded by the Tennessee state legislature, the experi- 
ment cost approximately $12 million. The study compared three different class arrange- 
ments for kindergarten through third grade: a regular-sized class, with 22 to 25 students 
per class, a single teacher, and no teacher’s aide; a small class, with 13 to 17 students per 
class and no teacher’s aide; and a regular-sized class with a teacher’s aide. 

Each school participating in the experiment had at least one class of each type, 
and students entering kindergarten in a participating school were randomly assigned 
to one of these three groups at the beginning of the 1985-1986 academic year. Teach- 
ers were also assigned randomly to one of the three types of classes. 

According to the original experimental protocol, students would stay in their ini- 
tially assigned class type for the 4 years of the experiment (kindergarten through third 
grade). However, because of parent complaints, students initially assigned to a regular 
class (with or without an aide) were randomly reassigned at the beginning of first grade 
to a regular class with an aide or to a regular class without an aide; students initially 
assigned to a small class remained in a small class. Students entering school in first 
grade (kindergarten was optional), in the second year of the experiment, were ran- 
domly assigned to one of the three groups. Each year students in the experiment were 
given standardized tests (the Stanford Achievement Test) in reading and math. 

The project paid for the additional teachers and aides necessary to achieve the 
target class sizes. During the first year of the study, approximately 6400 students 
participated in 108 small classes, 101 regular-sized classes, and 99 regular-sized classes 
with an aide. Over all 4 years of the study, a total of approximately 11,600 students 
at 80 schools participated in the study. 


Deviations from the experimental design. The experimental protocol specified that the 
students should not switch between class groups except through the re-randomization 
at the beginning of first grade. However, approximately 10% of the students switched 
in subsequent years for reasons including incompatible children and behavioral 
problems. These switches represent a departure from the randomization scheme and, 
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depending on the true nature of the switches, have the potential to introduce bias 
into the results. Switches made purely to avoid personality conflicts might be suffi- 
ciently unrelated to the experiment that they would not introduce bias. If, however, 
the switches arose because the parents most concerned with their children’s educa- 
tion pressured the school into switching a child into a small class, then this failure to 
follow the experimental protocol could bias the results toward overstating the effec- 
tiveness of small classes. Another deviation from the experimental protocol was that 
the class sizes changed over time because students switched between classes and 
moved in and out of the school district. 


Analysis of the STAR Data 


Because there are two treatment groups—small class and regular-sized class with an 
aide —the regression version of the differences estimator needs to be modified to 
handle the two treatment groups and the control group. This modification is done by 
introducing two binary variables, one indicating whether the student is in a small 
class and another indicating whether the student is in a regular-sized class with an 
aide, which leads to the population regression model 

Y; = Bo + B,SmallClass; + B,RegAide; + uj, (13.3) 
where Y, is a test score, SmallClass; = 1 if the i student is in a small class and = 0 
otherwise, and RegAide; = 1 if the it student is in a regular class with an aide 
and = 0 otherwise. The effect on the test score of a small class relative to a regular 
class is 64, and the effect of a regular class with an aide relative to a regular class is Bp. 


The differences estimator for the experiment can be computed by estimating 6, and 
f> in Equation (13.3) by OLS. 


Conditional Cash Transfers in Rural Mexico to Increase School Enrollment 


| n 1997, a program was devised that would give the transfers were allocated to all eligible households, 


money to poor mothers in rural Mexico on the but only within 314 of the original 495 communities 


condition that their children were enrolled in school. following a random selection process. This meant that 


Importantly, the allocation of these conditional cash 
transfers was conducted in a way that meant that the 
short-term impact of the program on enrolment could 
be analyzed effectively. Determining the allocation 
began by identifying 495 poor rural communities. 
Then, a census was conducted covering every house- 
hold within these communities. On the basis of this, 
households were divided into those eligible for the con- 


ditional cash transfers and those not eligible. Finally, 


the treatment, the conditional cash transfer, was ran- 
domly allocated across communities and that the 181 
communities not selected for conditional cash trans- 
fers could be used as the control group for the 314 
randomly selected communities. Econometric analy- 
sis was able to show that the randomization had been 
successful in creating balanced treatment and control 
groups and that the intervention was successful in 


increasing enrollment in school-age children. 
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7\:}8 ER Project STAR: Differences Estimates of Effect on Standardized Test Scores of Class Size 
Treatment Group 
Grade 
Regressor K 1 2 3 
Small class 13.90 29.78 19.39 15.59 
(4.23) (4.79) (5.12) (4.21) 
[5.48, 22.32] [20.24, 39.32] [9.18, 29.61] [721, 23.97] 
Regular-sized class with aide 0.31 11.96 3.48 —0.29 
(3.77) (4.87) (4.91) (4.04) 
[-719, 782] (2.27, 21.65] [-6.31, 13.27] [-8.35, 777] 
Intercept 918.04 1039.39 115781 1228.51 
(4.82) (5.82) (5.29) (4.66) 
Number of observations 5786 6379 6049 5967 
The regressions were estimated using the Project STAR public access data set described in Appendix 13.1. The dependent 
variable is the student’s combined score on the math and reading portions of the Stanford Achievement Test. Standard errors, 
| clustered at the school level, appear in parentheses, and 95% confidence intervals appear in brackets. J 


Because of the design of the experiment, the observations are not plausibly i.i.d. 
In particular, once a school is chosen, all students at the school participate. Because 
students at a given school typically come from the same area, they can share similar 
unobserved characteristics, such as parental education. Thus, the error term u; in 
Equation (13.3) could be correlated across students in the same school. While this 
correlation does not lead to bias, the standard errors need to be computed in a way 
that allows for this correlation. Because clustered standard errors allow for correlation 
within entities (schools) but not across entities (see Section 10.5 and Appendix 10.2), 
we compute standard errors clustered at the school level. 

Table 13.1 presents the differences estimates of the effect on test scores of being 
in a small class or in a regular-sized class with an aide. The dependent variable Y; in 
the regressions in Table 13.1 is the student’s total score on the combined math and 
reading portions of the Stanford Achievement Test. According to the estimates in 
Table 13.1, for students in kindergarten, the effect of being in a small class is an 
increase of 13.9 points on the test, relative to being in a regular class; the estimated 
effect of being in a regular class with an aide is only 0.31 points on the test. For each 
grade, the null hypothesis that small classes provide no improvement is rejected at 
the 0.5% (two-sided) significance level. However, it is not possible to reject the null 
hypothesis that having an aide in a regular class provides no improvement, relative 
to not having an aide, except in first grade, even at the 10% significance level. The 
estimated magnitudes of the improvements in small classes are broadly similar in 
grades K, 2, and 3, although the estimate is larger for first grade. 

The differences estimates in Table 13.1 suggest that reducing class size has an effect 
on test performance, but that adding an aide to a regular-sized class has a much smaller 
effect, possibly 0. As discussed in Section 13.1, augmenting the regressions in Table 13.1 
with additional regressors—the W regressors in Equation (13.2)—can provide more 
efficient estimates of the causal effects. Moreover, if the treatment received is not ran- 
dom because of failures to follow the treatment protocol, then the estimates of the 
experimental effects based on regressions with additional regressors could differ from 
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LEER S Project STAR: Differences Estimates with Additional Regressors for Kindergarten 
Regressor (1) (2) (3) (4) 
Small class 13.90 14.00 15.93 15.89 

(4.23) (4.25) (4.08) (3.95) 
[5.48, 22.32] [5.55, 22.46] [781, 24.06] [8.03, 23.74] 
Regular-sized class with aide 0.31 —0.60 1.22 1.79 
(3.77) (3.84) (3.64) (3.60) 
[-7.19, 782] [—8.25, 705] [—6.04, 8.47] [—5.38, 8.95] 
Teacher’s years of experience 1.47 0.74 0.66 
(0.44) (0.35) (0.36) 
[0.60, 2.34] [0.04, 1.45] [—0.05, 1.37] 
Boy —12.09 
(1.54) 
Free lunch eligible —34.70 
(2.47) 
Black —25.43 
(4.52) 
Race other than black or white —8.50 
(12.64) 
School indicator variables? no no yes yes 
R2 0.01 0.02 0.22 0.28 
Number of observations 5786 5766 5766 5748 
The regressions were estimated using the Project STAR public access data set described in Appendix 13.1. The dependent variable 
is the student’s combined test score on the math and reading portions of the Stanford Achievement Test. All regressions include an 
intercept (not reported). The number of observations differs in the different regressions because of some missing data. Standard 
errors, clustered at the school level, appear in parentheses, and 95% confidence intervals appear in brackets. 


=y 


the difference estimates reported in Table 13.1. For these two reasons, estimates of the 
experimental effects in which additional regressors are included in Equation (13.3) are 
reported for kindergarten in Table 13.2; the first column of Table 13.2 repeats the 
results of the first column of Table 13.1, and the remaining three columns include addi- 
tional regressors that measure teacher, school, and student characteristics. 

The main conclusion from Table 13.2 is that the multiple regression estimates of the 
causal effects of the two treatments (small class and regular-sized class with aide) in the 
final three columns of Table 13.2 are similar to the differences estimates reported in 
the first column. That adding these observable regressors does not change the estimated 
causal effects of the different treatments makes it more plausible that the random assign- 
ment to the smaller classes also does not depend on unobserved variables. As expected, 
these additional regressors increase the R? of the regression, and the standard error of 
the estimated class size effect decreases from 4.23 in column (1) to 3.95 in column (4). 

Because teachers were randomly assigned to class types within a school, the 
experiment also provides an opportunity to estimate the effect on test scores of 
teacher experience. In the terminology of Section 13.1, randomization is conditional 
on the covariates W, where W denotes a full set of binary variables indicating each 
school; that is, W denotes a full set of school fixed effects. Thus, conditional on W, 
years of experience is randomly assigned, which in turn implies that u; in Equation 
(13.2) satisfies conditional mean independence, where the X variables are the class 
size treatments and the teacher’s years of experience and W is the full set of school 
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fixed effects. Because teachers were not reassigned randomly across schools, without 
school fixed effects in the regression [Table 13.2, column (2)] years of experience will, 
in general, be correlated with the error term; for example, wealthier districts might 
have teachers with more years of experience. When school effects are included, the 
estimated coefficient on experience is cut in half, from 1.47 in column (2) of Table 13.2 
to 0.74 in column (3). Because teachers were randomly assigned within a school, 
column (3) produces an unbiased estimator of the effect on test scores of an addi- 
tional year of experience. The estimate, 0.74, is moderately large, although impre- 
cisely estimated: Ten years of experience corresponds to a predicted increase in test 
scores of 7.4 points, with a 95% confidence interval of (0.4, 14.5). 

It is tempting to interpret some of the other coefficients in Table 13.2 but, like 
coefficients on control variables generally, those coefficients do not have a causal 
interpretation. 


Interpreting the estimated effects of class size. Are the estimated effects of class size 
reported in Tables 13.1 and 13.2 large or small in a practical sense? There are two 
ways to answer this: first, by translating the estimated changes in raw test scores into 
units of standard deviations of test scores, so that the estimates in Table 13.1 are 
comparable across grades; and, second, by comparing the estimated class size effect 
to the other coefficients in Table 13.2. 

Because the distribution of test scores is not the same for each grade, the esti- 
mated effects in Table 13.1 are not directly comparable across grades. We faced this 
problem in Section 9.4, when we wanted to compare the effect on test scores of a 
reduction in the student-teacher ratio estimated using data from California to the 
effect estimated using data from Massachusetts. Because the two tests differed, the 
coefficients could not be compared directly. The solution in Section 9.4 was to trans- 
late the estimated effects into units of standard deviations of the test, so that a unit 
decrease in the student-teacher ratio corresponds to a change of an estimated frac- 
tion of a standard deviation of test scores. We adopt this approach here so that the 
estimated effects in Table 13.1 can be compared across grades. For example, the stan- 
dard deviation of test scores for children in kindergarten is 73.75, so the effect of 
being in a small class in kindergarten, based on the estimate in Table 13.1, is 
13.9/73.75 = 0.19, with a standard error of 4.23/73.75 = 0.06. 

The estimated effects of class size from Table 13.1, converted into units of the 
standard deviation of test scores across students, are summarized in Table 13.3. 
Expressed in standard deviation units, the estimated effect of being in a small class 
is similar for grades K, 2, and 3 and is approximately one-fifth of a standard deviation 
of test scores. Similarly, the result of being in a regular-sized class with an aide is 
approximately 0 for grades K,2, and 3. The estimated treatment effects are larger for 
first grade; however, the estimated difference between the small class and the regular- 
sized class with an aide is 0.20 for first grade, the same as for the other grades. Thus 
one interpretation of the first-grade results is that the students in the control group— 
the regular-sized class without an aide — happened to do poorly on the test that year 
for some unusual reason, perhaps simply random sampling variation. 
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WSEAS Estimated Class Size Effects in Units of Standard Deviations 
of the Test Score Across Students 


Grade 
Treatment Group K 1 2 3 
Small class 0.19 0.33 0.23 0.21 
(0.06) (0.05) (0.06) (0.06) 
Regular-sized class with aide 0.00 0.13 0.04 0.00 
(0.05) (0.05) (0.06) (0.06) 
Sample standard deviation 73.75 91.25 84.08 73.27 


of test scores (sy) 


The estimates and standard errors in the first two rows are the estimated effects in Table 13.1, divided 

by the sample standard deviation of the Stanford Achievement Test for that grade (the final row in this 
table), computed using data on the students in the experiment. Standard errors, clustered at the school 
| lyel; appear in parentheses. 


Another way to gauge the magnitude of the estimated effect of being in a small 
class is to compare the estimated treatment effects with the other coefficients in 
Table 13.2. In kindergarten, the estimated effect of being in a small class is 13.9 points 
on the test (first row of Table 13.2). Holding constant race, teacher’s years of experience, 
eligibility for free lunch, and the treatment group, boys score lower on the standardized 
test than girls by approximately 12 points, according to the estimates in column (4) of 
Table 13.2. Thus the estimated effect of being in a small class is somewhat larger than 
the performance gap between girls and boys. As another comparison, the estimated 
coefficient on the teacher’s years of experience in column (4) is 0.66, so having a teacher 
with 20 years of experience is estimated to improve test performance by 13 points. Thus 
the estimated effect of being in a small class is approximately the same as the effect of 
having a 20-year veteran as a teacher relative to having a new teacher. These compari- 
sons suggest that the estimated effect of being in a small class is meaningfully large. 


Additional results. Econometricians, statisticians, and specialists in elementary educa- 
tion have studied this experiment extensively, and we briefly summarize some of their 
findings here. One is that the effect of a small class is concentrated in the earliest 
grades, as can be seen in Table 13.3; except for the anomalous first-grade results, the 
test score gap between regular-sized and small classes reported in Table 13.3 is essen- 
tially constant across grades (0.19 standard deviation units in kindergarten, 0.23 in 
second grade, and 0.21 in third grade). Because the children initially assigned to a small 
class stayed in that small class, staying in a small class did not result in additional gains; 
rather, the gains made upon initial assignment were retained in the higher grades, but 
the gap between the treatment and control groups did not increase. Another finding 
is that, as indicated in the second row of Table 13.3, this experiment shows little benefit 
of having an aide in a regular-sized classroom. One potential concern about interpret- 
ing the results of the experiment is the failure to follow the treatment protocol for 
some students (some students switched from the small classes). If initial placement in 
a kindergarten classroom is random and has no direct effect on test scores, then initial 
placement can be used as an instrumental variable that partially, but not entirely, 
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influences placement. This strategy was pursued by Krueger (1999), who used two 
stage least squares (TSLS) to estimate the effect on test scores of class size using initial 
classroom placement as the instrumental variable; he found that the TSLS and OLS 
estimates were similar, leading him to conclude that deviations from the experimental 
protocol did not introduce substantial bias into the OLS estimates. An external valid- 
ity concern about all these results is that they pertain to a narrow measure, test scores 
at young ages. Chetty et al. (2011) used tax data to examine long-term outcomes for 
the students in the STAR experiment. Strikingly, they found that students randomly 
assigned to the small class in kindergarten had higher rates of college attendance than 
their peers randomly assigned to a regular-sized class.” 


Comparison of the Observational and Experimental 
Estimates of Class Size Effects 


The Project STAR experiment provides an opportunity that is rare in economics to 
compare an experimental estimate of a causal effect to estimates made using observa- 
tional data. Part II presented multiple regression estimates of the class size effect based 
on observational data for California and Massachusetts school districts. In those data, 
class size was not randomly assigned but instead was determined by local school offi- 
cials trying to balance educational objectives against budgetary realities. How do those 
observational estimates compare with the experimental estimates from Project STAR? 

To compare the California and Massachusetts estimates with those in Table 13.3, 
it is necessary to consider the same class size reduction and to express the predicted 
effect in comparable units, such as standard deviations of test scores. Over the four 
years of the STAR experiment, the small classes had, on average, approximately 75 
fewer students than the regular-sized classes, so we use the observational estimates 
to predict the effect on test scores of a reduction of 75 students per class. Based on 
the OLS estimates for the linear specifications summarized in the first column of 
Table 9.3, the California estimates predict an increase of 5.5 points on the test for a 
75 student reduction in the student-teacher ratio (0.73 X 7.5 = 5.5 points). The 
standard deviation of the test across students in California is approximately 38 points, 
so the estimated effect of the reduction of 75 students, expressed in units of standard 
deviations across students, is 5.5/38 = 0.14 standard deviations.’ The standard error 
of the estimated slope coefficient for California is 0.26 (Table 9.3), so the standard 
error of the estimated effect of a 75 student reduction in standard deviation units is 


?For further reading about Project STAR, see Mosteller (1995), Mosteller, Light, and Sachs (1996), and 
Krueger (1999). Ehrenberg et al. (2001a, 2001b) discuss Project STAR and place it in the context of the 
policy debate on class size and related research on the topic. For some criticisms of Project STAR, see 
Hanushek (1999a), and for a critical view of the relationship between class size and performance more 
generally, see Hanushek (1999b). 


3In Table 9.3, the estimated effects are presented in terms of the standard deviation of test scores across 
districts; in Table 13.3, the estimated effects are presented in terms of the standard deviation of test 
scores across students. The standard deviation across students is greater than the standard deviation across 
districts. For California, the standard deviation across students is 38, but the standard deviation across 
districts is 19.1. 
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Estimated Effects of Reducing the Student-Teacher Ratio by 7.5 Based on the STAR 
Data and the California and Massachusetts Observational Data 


Change in Standard Deviation 
Student-Teacher of Test Scores Across Estimated 95% Confidence 
Study By Ratio Students Effect Interval 
STAR (grade K) —13.90 Small class vs. 73.8 0.19 [0.08, 0.30] 
(4.23) regular-sized class (0.06) 
California —0.73 =75 38.0 0.14 [0.04, 0.24] 
(0.26) (0.05) 
Massachusetts —0.64 =75 39.0 0.12 [0.02, 0.22] 
(0.27) (0.05) 


The estimated coefficient Bi for the STAR study is taken from column (1) of Table 13.2. The estimated coefficients for the 
California and Massachusetts studies are taken from the first column of Table 9.3. The estimated effect is the effect of being in 
a small class versus a regular-sized class (for STAR) or the effect of reducing the student-teacher ratio by 7.5 (for the California 
and Massachusetts studies). The 95% confidence interval for the reduction in the student-teacher ratio is this estimated effect 

\ + 1.96 standard errors. Standard errors are given in parentheses under estimated effects. 


0.26 X 7.5/38 = 0.05. Thus, based on the California data, the estimated effect of 
reducing classes by 75 students, expressed in units of standard deviations of test 
scores across students, is 0.14 standard deviations, with a standard error of 0.05. These 
calculations and similar calculations for Massachusetts are summarized in Table 13.4, 
along with the STAR estimates for kindergarten taken from column (1) of Table 13.2. 

The estimated effects from the California and Massachusetts observational stud- 
ies are somewhat smaller than the STAR estimates. One reason that estimates from 
different studies differ, however, is random sampling variability, so it makes sense to 
compare confidence intervals for the estimated effects from the three studies. Based 
on the STAR data for kindergarten, the 95% confidence interval for the effect of 
being in a small class (reported in the final column of Table 13.4) is 0.08 to 0.30. The 
comparable 95% confidence interval based on the California observational data is 
0.04 to 0.24, and for Massachusetts, it is 0.02 to 0.22. Thus the 95% confidence inter- 
vals from the California and Massachusetts studies contain most of the 95% confi- 
dence interval from the STAR kindergarten data. Viewed in this way, the three 
studies give strikingly similar ranges of estimates. 

There are many reasons the experimental and observational estimates might 
differ. One reason is that, as discussed in Section 9.4, there are remaining threats to 
the internal validity of the observational studies. For example, because children move 
into and out of districts, the district student-teacher ratio might not reflect the 
student-teacher ratio actually experienced by the students, so the coefficient on the 
student-teacher ratio in the Massachusetts and California studies could be biased 
toward 0 because of errors-in-variables bias. In addition, the district average student- 
teacher ratio used in the observational studies is not the same thing as the actual 
number of children actually in a class, the STAR experimental variable. Other rea- 
sons concern external validity. Project STAR was conducted in a southern state in 
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13.4 


the 1980s, potentially different from California and Massachusetts in the late 1990s, 
and the grades being compared differ (K through 3 in STAR, fourth grade in 
Massachusetts, and fifth grade in California). In light of all these reasons to expect 
different estimates, the findings of the three studies are remarkably similar. That the 
estimates from the observational studies are similar to the Project STAR estimates 
suggests that the remaining threats to the internal validity of the observational esti- 
mates are minor. 


Quasi-Experiments 


The statistical insights and methods of randomized controlled experiments can carry 
over to nonexperimental settings. In a quasi-experiment, also called a natural 
experiment, randomness is introduced by variations in individual circumstances that 
make it appear as if the treatment is randomly assigned. These variations in indi- 
vidual circumstances might arise because of vagaries in legal institutions, location, 
timing of policy or program implementation, natural randomness such as birth dates, 
rainfall, or other factors that are unrelated to the causal effect under study. 

We consider two types of quasi-experiments. In the first, whether an individual 
(more generally, an entity) receives treatment is viewed as if it is randomly deter- 
mined. In this case, the causal effect can be estimated by OLS using the treatment, 
X; as a regressor. In the second type of quasi-experiment, the as-if random variation 
only partially determines the treatment. In this case, the causal effect is estimated by 
instrumental variables regression, where the as-if random source of variation pro- 
vides the instrumental variable. 

After providing some examples, this section presents some extensions of the 
econometric methods in Sections 13.1 and 13.2 that can be useful for analyzing data 
from quasi-experiments. 


Examples 


We illustrate these two types of quasi-experiments by examples. The first example is 
a quasi-experiment in which the treatment is as-if randomly determined. The second 
and third examples illustrate quasi-experiments in which the as-if random variation 
influences, but does not entirely determine, the level of the treatment. 


Example 1: Labor market effects of immigration. Does immigration reduce wages? 
Economic theory suggests that if the supply of labor increases because of an influx 
of immigrants, the “price” of labor—the wage—should fall. However, all else being 
equal, immigrants are attracted to cities with high labor demand, so the OLS estima- 
tor of the effect on wages of immigration will be biased. An ideal randomized con- 
trolled experiment for estimating the effect on wages of immigration would randomly 
assign different numbers of immigrants (different “treatments”) to different labor 
markets (“subjects”) and measure the effect on wages (the “outcome”). Such an 
experiment, however, faces severe practical, financial, and ethical problems. 


13.4 Quasi-Experiments 491 


The labor economist David Card (1990) therefore used a quasi-experiment in 
which a large number of Cuban immigrants entered the Miami, Florida, labor market 
in the Mariel boatlift, which resulted from a temporary lifting of restrictions on emi- 
gration from Cuba in 1980. Half of the immigrants settled in Miami, in part because 
it had a large preexisting Cuban community. Card estimated the causal effect on 
wages of an increase in immigration by comparing the change in wages of low-skilled 
workers in Miami to the change in wages of similar workers in comparable U.S. cities 
over the same period. He concluded that this influx of immigrants had a negligible 
effect on wages of less-skilled workers. 


Example 2: Effects of class size on educational achievement. Experiments such as 
the Project STAR, discussed in Section 13.3, are rare. In particular, the results of such 
an experiment may not be considered generalizable beyond the study itself. We have 
already discussed whether the results from Tennessee in the 1980s could be generaliz- 
able to California and Massachusetts in the late 1990s, but what about the generaliz- 
ability of its results to other countries at different points in time? 

This particular research question is universally policy relevant, which means that 
countries across the world will want to answer this question for their context to make 
evidence-based educational policy. Similarly, the research question presents chal- 
lenges to econometric analysis in most countries. Urquiola (2006) analyzes this ques- 
tion in the context of rural Bolivia, using a regression discontinuity design similar to 
that employed by Angrist and Lavy (1999) in a study set within Israel. The use of a 
quasi-experimental design is justified on the basis that enrollment can be both posi- 
tively related to class size and socio-economic status, which would result in a bias 
towards finding a positive link between class size and educational achievement. The 
existence of a discontinuity in the Bolivian data occurs because of a regulation that 
allows schools to obtain an additional teacher if there are more than 30 students in a 
given grade. However, the discontinuity is not hard and fast because in practice some 
schools with over 30 students per grade do not obtain an additional teacher. It there- 
fore represents a “fuzzy” discontinuity. The results of this econometric strategy reveal 
a substantial estimated negative effect of class size, somewhat larger in magnitude 
than the effect estimated in other contexts: “a 1-standard-deviation reduction in class 


size (approximately eight students) raises scores by up to 0.3 standard deviations.” 


Example 3: The effect of cardiac catheterization. Section 12.5 described the study 
by McClellan, McNeil, and Newhouse (1994), in which they used the distance from a 
heart attack patient’s home to a cardiac catheterization hospital, relative to the dis- 
tance to a hospital lacking catheterization facilities, as an instrumental variable for 
actual treatment by cardiac catheterization. This study is a quasi-experiment with a 
variable that partially determines the treatment. The treatment itself, cardiac cathe- 
terization, is determined by personal characteristics of the patient and by the deci- 
sion of the patient and doctor; however, it is also influenced by whether a nearby 
hospital is capable of performing this procedure. If the location of the patient is as-if 


‘The MIT Press Journals, Miguel Urquiola, Identifying Class Size Effects in Developing Countries: 
Evidence from Rural Bolivia, March 29, 2006. 
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randomly assigned and has no direct effect on health outcomes other than through 
its effect on the probability of catheterization, then the relative distance to a cathe- 
terization hospital is a valid instrumental variable. 


The Differences-in-Differences Estimator 


If the treatment in a quasi-experiment is as-if randomly assigned, conditional on 
some observed variables W, then the treatment effect can be estimated using the 
differences regression in Equation (13.2). Because the researcher does not have con- 
trol over the randomization, however, some differences might remain between the 
treatment and control groups even after controlling for W. One way to adjust for 
those remaining differences between the two groups is to compare not the outcomes 
Y but the change in the outcomes pre- and posttreatment, thereby adjusting for 
differences in pretreatment values of Y in the two groups. Because this estimator is 
the difference across groups in the change, or difference over time, it is called the 
differences-in-differences estimator. For example, in his study of the effect of immi- 
gration on low-skilled workers’ wages, Card (1990) used a differences-in-differences 
estimator to compare the change in wages in Miami with the change in wages in other 
US. cities. 


The differences-in-differences estimator. Let Yemen, before be the sample average 
of Y for those in the treatment group before the experiment, and let Y"#"”"""" after be 
the sample average for the treatment group after the experiment. Let Y°0”"" before 
and Y°rel.afier be the corresponding pretreatment and post-treatment sample 
averages for the control group. The average change in Y over the course of the 


experiment for those in the treatment group is Yemen afier _ yireatment, before 


and the average change in Y over this period for those in the control group is 
Yoontrol, after _ control, before The differences-in-differences estimator is the average 
change in Y for those in the treatment group minus the average change in Y for those 


in the control group: 


Adiffs-in-diffs — / y treatment, after ‘treatment, before) _ {y control, after w control, before 
piinas = (Y fier _ Y f )- (Y fier _ Y fe ) 
= AY treatment _ Ay control (13.4) 


yireament is the average change in Y in the treatment group and AY"””! is 


whereA 
the average change in Y in the control group. If the treatment is randomly assigned, 
then piitiein-aiffs is an unbiased and consistent estimator of the causal effect. 

The differences-in-differences estimator can be written in regression notation. 
Let AY, be the postexperimental value of Y for the i" individual minus the 
preexperimental value. The differences-in-differences estimator is the OLS estimator 


of 6 in the regression 


Me 
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| FIGURE 13.1 | The Differences-in-Differences Estimator 


The posttreatment difference between Outcome 

the treatment and control groups is 90 — 

80 — 30 = 50, but this overstates the 80 L Y treatment, after 
treatment effect because before the 

treatment Y was higher for the treat- 70 = Baiffs-in-diffs 
ment group than the control group 60 H i 

by 40 — 20 = 20. The differences- 50L 

in-differences estimator is the dif- 

ference between the final and initial a Ytreatment, before 

gaps, so Baiisin-aiffs — (80 — 30) — 30 H 

(40 — 20) = 50 — 20 = 30. Equiva- zpi oir after 
lently, the differences-in-differences Yeontrol, before 

estimator is the average change for the O= 

treatment group minus the average 0 l l 

change for the control group; that is, t= 2 

Bilffsin-diffs = AYireatment _ a yeontrol — Time period 
(80 — 40) — (30 — 20) = 30. 


The differences-in-differences estimator is illustrated in Figure 13.1. In that 
figure, the sample average of Y for the treatment group is 40 before the experiment, 
whereas the pretreatment sample average of Y for the control group is 20. Over the 
course of the experiment, the sample average of Y increases in the control group to 
30, whereas it increases to 80 for the treatment group. Thus the mean difference of 
the posttreatment sample averages is 80 — 30 = 50. However, some of this differ- 
ence arises because the treatment and control groups had different pretreatment 
means: The treatment group started out ahead of the control group. The differences- 
in-differences estimator measures the gains of the treatment group relative to the 


control group, which in this example is (80 — 40) — (30 — 20) = 30. By focusing 
on the change in Y over the course of the experiment, the differences-in-differences 
estimator removes the influence of initial values of Y that vary between the treat- 
ment and control groups. 


The differences-in-differences estimator with additional regressors. The 
differences-in-differences estimator can be extended to include additional regressors 
W,;,...,W,;. These variables can be individual characteristics prior to the experi- 
ment, or they can be control variables. These additional regressors can be incorpo- 
rated using the multiple regression model 


AY, = Po + Bee + Wai t te + Bi+rWi Pg TH Lye gm (13.6) 


The OLS estimator of 6; in Equation (13.6) is the differences-in-differences estimator 
with additional regressors. If X; is as-if randomly assigned, conditional on W,;,... , Wi, 
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then u; satisfies conditional mean independence, and the OLS estimator of Bi in 
Equation (13.6) is unbiased. 

The differences-in-differences estimator described here considers two time peri- 
ods, before and after the experiment. In some settings, there are panel data with 
multiple time periods. The differences-in-differences estimator can be extended to 
multiple time periods using the panel data regression methods of Chapter 10. 


Differences-in-differences using repeated cross-sectional data. A repeated 
cross-sectional data set is a collection of cross-sectional data sets, where each cross- 
sectional data set corresponds to a different time period. For example, the data set 
might contain observations on 400 individuals in the year 2004 and on 500 different 
individuals in 2005, for a total of 900 different individuals. One example of repeated 
cross-sectional data is political polling data, in which political preferences are mea- 
sured by a series of surveys of randomly selected potential voters, where the surveys 
are taken at different dates and each survey has different respondents. 

The premise of using repeated cross-sectional data is that if the individuals (more 
generally, entities) are randomly drawn from the same population, then the individu- 
als in the earlier cross section can be used as surrogates for the individuals in the 
treatment and control groups in the later cross section. 

When there are two time periods, the regression model for repeated cross- 
sectional data is 


Ya = Po + BiXiı + 2G; + BaD, + BiWig Fete + Bs+rWrit t Uin (13-7) 


where X; is the actual treatment of the i” individual (entity) in the cross section in 
period ¢ (t = 1, 2), G;is a binary variable indicating whether the individual is in the 
treatment group (or in the surrogate treatment group if the observation is in the 
pretreatment period), and D, is the binary indicator that equals 0 in the first period 
and equals 1 in the second period. The i" individual receives treatment if he or she 
is in the treatment group in the second period, so in Equation (13.7), X; = G; X D; 
that is, X; is the interaction between G; and D,. 

If the quasi-experiment makes X; as-if randomly received, conditional on the W’s, 
then the causal effect can be estimated by the OLS estimator of £; in Equation (13.7). 
If there are more than two time periods, then Equation (13.7) is modified to contain 
T — 1 binary variables indicating the different time periods (see Section 10.4). 


Instrumental Variables Estimators 


If the quasi-experiment yields a variable Z; that influences receipt of treatment, if data 
are available both on Z; and on the treatment actually received (Xj), and if Z; is as-if 
randomly assigned (perhaps after controlling for some additional variables W,), then Z; 
is a valid instrument for X;, and the coefficients of Equation (13.2) can be estimated using 
two stage least squares. Any control variables appearing in Equation (13.2) also appear 
as control variables in the first stage of the two stage least squares estimator of 64. 


Suppose that the 
binary treatment X is 
required if W is less 
than the threshold 
value wọ = 2. As long 
as the only role of 
the threshold wg is to 
mandate treatment, 
the treatment effect is 
given by the magni- 
tude of the jump, or 
discontinuity, in the 
regression function at 
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| FIGURE 13.2 | A Hypothetical Regression Discontinuity Design Scatterplot 


Population regression line 


W = 2. 


Regression Discontinuity Estimators 


One situation that gives rise to a quasi-experiment is when receipt of the treatment 
depends in whole or in part on whether an observable variable W crosses a threshold 
value. For example, suppose that students are required to attend summer school if 
their end-of-year grade point average (GPA) falls below a threshold.° Then one way 
to estimate the effect of mandatory summer school is to compare outcomes for stu- 
dents whose GPA was just below the threshold (and thus were required to attend) to 
outcomes for students whose GPA was just above the threshold (so they escaped 
summer school). The outcome Y could be next year’s GPA, whether the student 
drops out, or future earnings. As long as there is nothing special about the threshold 
value other than its use in mandating summer school, it is reasonable to attribute any 
jump in outcomes at that threshold to summer school. Figure 13.2 illustrates a hypo- 
thetical scatterplot of a data set in which the treatment (summer school, X) is 
required if GPA (W) is less than a threshold value (wọ = 2.0).The scatterplot shows 
next year’s GPA (Y) for a hypothetical sample of students as a function of this year’s 
GPA, along with the population regression function. If the only role of the threshold 
Wo is to mandate summer school, then the jump in next year’s GPA at wọ is an esti- 
mate of the effect of summer school on next year’s GPA. 

Because of the jump, or discontinuity, in treatment at the threshold, studies that 
exploit a discontinuity in the probability of receiving treatment at a threshold value 
are called regression discontinuity designs. There are two types of regression discon- 
tinuity designs, sharp and fuzzy. 


‘This example is a simplified version of the regression discontinuity study of the effect of summer school 
for elementary and middle school students by Jordan Matsudaira (2008), in which summer school atten- 
dance was based in part on end-of-year tests. 
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Sharp regression discontinuity design. In a sharp regression discontinuity design, 
receipt of treatment is entirely determined by whether W exceeds the threshold: All 
students with W < wọ attend summer school, and no students with W = wg attend; that 
is, X; = 1if W < wo, and X = Oif W = wọ. In this case, the jump in Y at the threshold 
equals the average treatment effect for the subpopulation with W = wo, which might be 
a useful approximation to the average treatment effect in the larger population of inter- 
est. If the regression function is linear in W, other than for the treatment-induced discon- 
tinuity, the treatment effect can be estimated by 6; in the regression: 


Y; = Bo + BX; + BW; + u; (13.8) 


If the regression function is nonlinear, then a suitable nonlinear function of W can 
be used (Section 8.2). 


Fuzzy regression discontinuity design. In a fuzzy regression discontinuity design, 
crossing the threshold influences receipt of the treatment but is not the sole determi- 
nant. For example, suppose that some students whose GPA falls below the threshold 
are exempted from summer school while some whose GPA exceeds the threshold 
nevertheless attend. This situation could arise if the threshold rule is part of a more 
complicated process for determining treatment. In a fuzzy design, X; will, in general, 
be correlated with uw; in Equation (13.8). If, however, any special effect of crossing the 
threshold operates solely by increasing the probability of treatment—that is, the 
direct effect of crossing the threshold is captured by the linear term in W—then an 
instrumental variables approach is available. Specifically, let the binary variable Z; 
indicate crossing the threshold (so Z; = 1 if W; < wọ and Z; = Oif W, = wo). Then 
Z, influences receipt of treatment but is uncorrelated with w; so it is a valid instru- 
ment for X;. Thus, in a fuzzy regression discontinuity design, 8; can be estimated by 
instrumental variables estimation of Equation (13.8), using as an instrument the 
binary variable indicating that W; < wo. 


Potential Problems with Quasi-Experiments 


Like all empirical studies, quasi-experiments face threats to internal and external 
validity. A particularly important potential threat to internal validity is whether the 
as-if randomization, in fact, can be treated reliably as true randomization. 


Threats to Internal Validity 


The threats to the internal validity of true randomized controlled experiments listed 
in Section 13.2 also apply to quasi-experiments but with some modifications. 


Failure of randomization. Quasi-experiments rely on differences in individual 
circumstances—legal changes, sudden unrelated events, and so forth—to provide the 
as-if randomization in the treatment level. If this as-if randomization fails to produce 
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a treatment level_X (or an instrumental variable Z) that is random, then, in general, the 
OLS estimator is biased (or the instrumental variable estimator is not consistent). 

As in a true experiment, one way to test for failure of randomization is to check for 
systematic differences between the treatment and control groups, for example by 
regressing X (or Z) on the individual characteristics (the W’s) and testing the hypothesis 
that the coefficients on the W’s are 0. If differences exist that are not readily explained 
by the nature of the quasi-experiment, then that is evidence that the quasi-experiment 
did not produce true randomization. Even if there is no relationship between_X (or Z) 
and the W’s, the possibility remains that X (or Z) could be related to some of the unob- 
served factors in the error term u. Because these factors are unobserved, this possibility 
cannot be tested, and the validity of the assumption of as-if randomization must be 
evaluated using expert knowledge and judgment applied to the application at hand. 


Failure to follow the treatment protocol. In a true experiment, failure to follow the 
treatment protocol arises when members of the treatment group fail to receive treat- 
ment, members of the control group actually receive treatment, or both; in conse- 
quence, the OLS estimator of the causal effect has selection bias. The counterpart to 
failing to follow the treatment protocol in a quasi-experiment is when the as-if ran- 
domization influences, but does not determine, the treatment level. In this case, the 
instrumental variables estimator based on the quasi-experimental influence Z can be 
consistent even though the OLS estimator is not. 


Attrition. Attrition in a quasi-experiment is similar to attrition in a true experiment in 
the sense that if attrition arises because of personal choices or characteristics, then it can 
induce correlation between the treatment level and the error term. The result is sample 
selection bias, so the OLS estimator of the causal effect is biased and inconsistent. 


Experimental effects. An advantage of quasi-experiments is that because they are 
not true experiments, there typically is no reason for individuals to think that they 
are experimental subjects. Thus experimental effects such as the Hawthorne effect 
generally are not germane in quasi-experiments. 


Instrument validity in quasi-experiments. An important step in evaluating a study 
that uses instrumental variables regression is careful consideration of whether the 
instrument is in fact valid. This general statement remains true in quasi-experimental 
studies in which the instrument is as-if randomly determined. As discussed in 
Chapter 12, instrument validity requires both instrument relevance and instrument 
exogeneity. Because instrument relevance can be checked using the statistical meth- 
ods summarized in Key Concept 12.5, here we focus on the second, more judgmental 
requirement of instrument exogeneity. 

Although it might seem that a randomly assigned instrumental variable is neces- 
sarily exogenous, that is not so. Consider the examples of Section 13.4. In Angrist’s 
(1990) use of draft lottery numbers as an instrumental variable in studying the effect 
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on civilian earnings of military service, the lottery numbers were, in fact, randomly 
assigned. But as Angrist (1990) points out and discusses, if a low draft number results 
in behavior aimed at avoiding the draft and that avoidance behavior subsequently 
affects civilian earnings, then a low lottery number (Z;) could be related to unob- 
served factors that determine civilian earnings (u;); that is, Z; and u; are correlated 
even though Z; is randomly assigned. As a second example, McClellan, McNeil, and 
Newhouse’s (1994) study of the effect on heart attack patients of cardiac catheteriza- 
tion treated the relative distance to a catheterization hospital as if it were randomly 
assigned. But as the authors highlight and examine, if patients who live close to a 
catheterization hospital are healthier than those who live far away (perhaps because 
of better access to medical care generally), then the relative distance to a catheteriza- 
tion hospital would be correlated with omitted variables in the error term of the 
health outcome equation. In short, just because an instrument is randomly deter- 
mined or as-if randomly determined does not necessarily mean it is exogenous in the 
sense that corr ( Z; u;) = 0. Thus the case for exogeneity must be scrutinized closely 
even if the instrument arises from a quasi-experiment. 


Threats to External Validity 


Quasi-experimental studies use observational data, and the threats to the external 
validity of a study based on a quasi-experiment are generally similar to the threats 
discussed in Section 9.1 for conventional regression studies using observational data. 

One important consideration is that the special events that create the as-if ran- 
domness at the core of a quasi-experimental study can result in other special features 
that threaten external validity. For example, Card’s (1990) study of labor market 
effects of immigration discussed in Section 13.4 used the as-if randomness induced 
by the influx of Cuban immigrants in the Mariel boatlift. There were, however, spe- 
cial features of the Cuban immigrants, Miami, and its Cuban community that might 
make it difficult to generalize these findings to immigrants from other countries or 
to other destinations. Similarly, Angrist’s (1990) study of the labor market effects of 
serving in the U.S. military during the Vietnam War presumably would not generalize 
to peacetime military service. As usual, whether a study generalizes to a specific 
population and setting of interest depends on the details of the study and must be 
assessed on a case-by-case basis. 


Experimental and Quasi-Experimental 
Estimates in Heterogeneous Populations 


As discussed in Section 13.1, the causal effect can vary from one member of the 
population to the next. Section 13.1 discusses estimating causal effects that vary 
depending on observable variables, such as sex. In this section, we consider the con- 
sequences of unobserved variation in the causal effect. We refer to unobserved varia- 
tion in the causal effect as having a heterogeneous population. To keep things simple 
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and to focus on the role of unobserved heterogeneity, in this section we omit control 
variables W; the conclusions of this section carry over to regressions including control 
variables. 

If the population is heterogeneous, then the i" individual now has his or her own 
causal effect, Bı; which (in the terminology of Section 13.1) is the difference in the i 
individual’s potential outcomes if the treatment is or is not received. For example, 64; 
might be 0 for a resume-writing training program if the i individual already knows how 
to write a resume. With this notation, the population regression equation can be written 


Y; = Bo + buið + u; (13.9) 


Appendix 13.3 derives Equation (13.9) from the potential outcomes framework for 
a heterogeneous population. Because ß;; varies from one individual to the next in the 
population and the individuals are selected from the population at random, 64; is a 
random variable that, just like u;, reflects unobserved variation across individuals (for 
example, variation in preexisting resume-writing skills). The average causal effect is 
the population mean value of the causal effect, E ( B,;); that is, it is the expected causal 
effect of a randomly selected member of the population under study. 

What do the estimators of Sections 13.1, 13.2, and 13.4 estimate if there is popu- 
lation heterogeneity of the form in Equation (13.9)? We first consider the OLS esti- 
mator when X; is as-if randomly determined; in this case, the OLS estimator is a 
consistent estimator of the average causal effect. That is generally not true for the IV 
estimator, however. Instead, if X; is partially influenced by Z; then the IV estimator 
using the instrument Z estimates a weighted average of the causal effects, where 
those for whom the instrument is most influential receive the most weight. 


OLS with Heterogeneous Causal Effects 


If there is heterogeneity in the causal effect and if X; is randomly assigned, then the 
differences estimator is a consistent estimator of the average causal effect. This result 
follows from the discussion in Section 13.1 and Appendix 13.3, which make use of the 
potential outcome framework; here it is shown without reference to potential out- 
comes by applying concepts from Chapters 3 and 4 directly to the random coeffi- 
cients regression model in Equation (13.9). 

The OLS estimator of 8; in Equation (13.1) is B= sep / sx [Equation (4.5)]. If 
the observations are i.i.d., then the sample covariance and variance are consistent 
estimators of the population covariance and variance, so Bi — oy /o% If X; is 
randomly assigned, then X; is distributed independently of other individual charac- 
teristics, both observed and unobserved, and in particular is distributed indepen- 
dently of B,;. Accordingly, the OLS estimator Bi has the limit 


^ oxy P Oxy cov( By + BX; + uj, X;) 


7 s oF ox 
cov(Bi:X;, X;) 
= 2 = E(B), 


o% (13.10) 
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where the third equality uses the facts about covariances in Key Concept 2.3 and 
cov(u;, X;) = 0, which is implied by E(u;|X;) = 0 [Equation (2.28)], and where the 
final equality follows from 64; being distributed independently of X;, which it is if X; 
is randomly assigned (Exercise 13.9). Thus, if X; is randomly assigned, By is a consis- 
tent estimator of the average causal effect E(;;). 


IV Regression with Heterogeneous Causal Effects 


Suppose that the causal effect is estimated by instrumental variables regression of Y; 
on X; (treatment actually received) using Z; (initial randomly or as-if randomly 
assigned treatment) as an instrument. Suppose that Z; is a valid instrument (relevant 
and exogenous) and that there is heterogeneity in the effect on X; of Z;. Specifically, 
suppose that X; is related to Z; by the linear model 


X; = To ate Tiili T Vi, (13.11) 


where the coefficient 7; varies from one individual to the next. Equation (13.11) is 
the first-stage equation of TSLS with the modification that the effect on X; of a 
change in Z; is allowed to vary from one individual to the next. 

The TSLS estimator is B/S45 = Szy/Szx [Equation (12.4)], the ratio of the sam- 
ple covariance between Z and Y to the sample covariance between Z and X. If the 
observations are 1.i.d., then these sample covariances are consistent estimators of the 
population covariances, so Br SES P ozy/Ozx. Suppose that the instrument Z; is 
randomly assigned or as-if randomly assigned, so that Z; is distributed independently 
of (uj, Vi, Tii, Bii) and that E(m;) # 0 (instrument relevance). It is shown in Appen- 
dix 13.2 that, under these assumptions, 


pyses _ SZY P 5 IZY E(Bumi) (13.12) 
SZX OZXx E(mi) 

That is, the TSLS estimator converges in probability to the ratio of the expected value 

of the product of 64; and m; to the expected value of m; 

The final ratio in Equation (13.12) is a weighted average of the individual causal 
effects Bı; The weights are m;/E(m;), which measure the relative degree to which 
the instrument influences whether the i individual receives treatment. Thus the 
TSLS estimator is a consistent estimator of a weighted average of the individual 
causal effects, where the individuals who receive the most weight are those for whom 
the instrument is most influential. The weighted average causal effect that is estimated 
by TSLS is called the local average treatment effect (LATE). The term /ocal empha- 
sizes that it is the weighted average that places the most weight on those individuals 
(more generally, entities) whose treatment probability is most influenced by the 
instrumental variable. 

There are three special cases in which the LATE equals the average treatment effect: 


1. The treatment effect is the same for all individuals. This case corresponds to 
Bi; = B, for all i. Then the final expression in Equation (13.12) simplifies to 


E( Bum) /E(mi) = BrE(mi)/E(m) = Bi. 
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2. The instrument affects each individual equally. This case corresponds to 
7; = T for all i. In this case, the final expression in Equation (13.12) simpli- 
fies to E(Bym;) /E(mi) = E(Bui)m/m = E( Bui). 

3. The heterogeneity in the treatment effect and heterogeneity in the effect of 
the instrument are uncorrelated. This case corresponds to 64; and m; being ran- 
dom but cov (bi; 7;) = 0. Because E (Bumu) = cov(Bi;, mi) + E(B) E (mi) 
[Equation (2.35)], if cov( Bip mu) = 0, then E(Bumi) = E(B) E(m;), and 
the final expression in Equation (13.12) simplifies to E( Bumi) /E(mi) = 
E(B) E(mi)/E(mi) = E(B). 


In each of these three cases, there is population heterogeneity in the effect of the 
instrument, in the effect of the treatment, or in both, but the LATE equals the aver- 
age treatment effect. That is, in all three cases, TSLS is a consistent estimator of the 
average treatment effect. 

Aside from these three special cases, in general, the LATE differs from the aver- 
age treatment effect. For example, suppose that Z; has no influence on the treatment 
decision for half the population (for them, m; = 0), while for the other half, Z; has a 
common, nonzero influence on the treatment decision (for them, 7; takes on the 
same nonzero value). Then TSLS is a consistent estimator of the average treatment 
effect in the half of the population for which the instrument influences the treatment 
decision. To be concrete, suppose workers are eligible for a job training program and 
are randomly assigned a priority number Z, which influences how likely they are to 
be admitted to the program. Half the workers know they will benefit from the pro- 
gram and thus may decide to enroll in the program; for them, 6,; = Bt > 0 and 
Tmi = m] > 0.The other half know that, for them, the program is ineffective, so they 
would not enroll even if admitted; that is, for them 64; = p1 and m; = 0. The average 
treatment effect is E(B) = £(Bt + Br). The local average treatment effect is 
E(Bumi)/E(m;). Now E(m;) = yf and E(Bim;) = (B1 X 0+ Brat) = zbirni, 
so E(B1;™;) /E(m,;) = Bt. Thus in this example the LATE is the causal effect for 
those workers who might enroll in the program, and it gives no weight to those who 
will not enroll under any circumstances. In contrast, the average treatment effect 
places equal weight on all individuals, regardless of whether they would enroll. 
Because individuals decide to enroll based in part on their knowledge of how effec- 
tive the program will be for them, in this example the LATE exceeds the average 
treatment effect. 


Implications. If an individual’s decision to receive treatment depends on the effec- 
tiveness of the treatment for that individual, then the TSLS estimator, in general, is 
not a consistent estimator of the average causal effect. Instead, TSLS estimates a 
LATE, where the causal effects of the individuals who are most influenced by the 
instrument receive the greatest weight. 

This conclusion leads to a disconcerting situation in which two researchers, 
armed with different instrumental variables that are both valid in the sense that both 
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are relevant and exogenous, would obtain different estimates of “the” causal effect, 
even in large samples. The difference arises because each researcher is implicitly 
estimating a different weighted average of the individual causal effects in the popula- 
tion. In fact, a J-test of overidentifying restrictions can reject if the two instruments 
estimate different LATEs, even if both instruments are valid. Although both estima- 
tors provide some insight into the distribution of the causal effects via their respec- 
tive weighted averages of the form in Equation (13.12), in general, neither estimator 
is a consistent estimator of the average causal effect. 


Example: The cardiac catheterization study. Sections 12.5 and 13.4 discuss 
McClellan, McNeil, and Newhouse’s (1991) study of the effect on mortality of cardiac 
catheterization of heart attack patients. The authors used instrumental variables 
regression, with the relative distance to a cardiac catheterization hospital as the 
instrumental variable. Based on their TSLS estimates, they found that cardiac cath- 
eterization had little or no effect on health outcomes. This result is surprising: Medi- 
cal procedures such as cardiac catheterization are subjected to rigorous clinical trials 
prior to approval for widespread use. Moreover, cardiac catheterization allows sur- 
geons to perform medical interventions that would have required major surgery a 
decade earlier, making these interventions safer and, presumably, better for long- 
term patient health. How could this econometric study fail to find beneficial effects 
of cardiac catheterization? 

One possible answer is that there is heterogeneity in the treatment effect of 
cardiac catheterization. For some patients, this procedure is an effective intervention, 
but for others, perhaps those who are healthier, it is less effective or, given the risks 
involved with any surgery, perhaps on the whole ineffective. Thus the average causal 
effect in the population of heart attack patients could be, and presumably is, positive. 
The IV estimator, however, measures a marginal effect, not an average effect, where 
the marginal effect is the effect of the procedure on those patients for whom relative 
distance to a cardiac catheterization hospital is an important factor in whether they 
receive treatment. But those patients could be just the relatively healthy patients for 
whom, on the margin, cardiac catheterization is a relatively ineffective procedure. If 
so, McClellan, McNeil, and Newhouse’s TSLS estimator measures the effect of the 
procedure for the marginal patient (for whom it is relatively ineffective), not for the 
average patient (for whom it might be effective). 


There are several good (but advanced) discussions of the effect of population heterogeneity on program 
evaluation estimators. They include the survey by Heckman, LaLonde, and Smith (1999, Section 7) and 
James Heckman’s lecture delivered when he received the Nobel Prize in Economics (Heckman, 2001, 
Section 7). The latter reference and Angrist, Graddy, and Imbens (2000) provide detailed discussion of the 
random effects model (which treats £; as varying across individuals) and provide more general versions 
of the result in Equation (13.12). The concept of the LATE was introduced by Imbens and Angrist (1994), 
who showed that, in general, it does not equal the average treatment effect. Imbens and Wooldridge 
(2009) provide an advanced survey of methods for program evaluation with treatment effect heterogene- 
ity, including those discussed in this chapter. 
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Conclusion 


In Chapter 1, we defined the causal effect in terms of the expected outcome of an 
ideal randomized controlled experiment. If a randomized controlled experiment is 
available or can be performed, it can provide compelling evidence on the causal 
effect under study, although even randomized controlled experiments are subject to 
potentially important threats to internal and external validity. 

Despite their advantages, randomized controlled experiments in economics face 
considerable hurdles, including ethical concerns and cost. The insights of experimental 
methods can, however, be applied to quasi-experiments, in which special circumstances 
make it seem as if randomization has occurred. In quasi-experiments, the causal effect 
can be estimated using a differences-in-differences estimator, possibly augmented with 
additional regressors; if the as-if randomization only partly influences the treatment, 
then instrumental variables regression can be used instead. An important advantage of 
quasi-experiments is that the source of the as-if randomness in the data is usually trans- 
parent and thus can be evaluated in a concrete way. An important threat confronting 
quasi-experiments is that sometimes the as-if randomization is not really random, so 
the treatment (or the instrumental variable) is correlated with omitted variables and 
the resulting estimator of the causal effect is biased. 

Quasi-experiments provide a bridge between observational data sets and true 
randomized controlled experiments. The econometric methods used in this chapter 
for analyzing quasi-experiments are familiar ones developed in different contexts in 
earlier chapters: OLS, panel data estimation methods, and instrumental variables 
regression. What differentiates quasi-experiments from the applications examined in 
Part II and the earlier chapters in Part III are the way in which these methods are 
interpreted and the data sets to which they are applied. Quasi-experiments provide 
econometricians with a way to think about how to acquire new data sets, how to think 
of instrumental variables, and how to evaluate the plausibility of the exogeneity 


assumptions that underlie OLS and instrumental variables estimation.” 


Summary 


1. The average causal effect in the population under study is the expected differ- 
ence in the average outcomes for the treatment and control groups in an ideal 
randomized controlled experiment. Actual experiments with human subjects 
deviate from an ideal experiment for various practical reasons, including the 
failure of people to comply with the experimental protocol. 


7Shadish, Cook, and Campbell (2002) provide a comprehensive treatment of experiments and quasi- 
experiments in the social sciences and in psychology. An important line of research in development eco- 
nomics focuses on experimental evaluations of health and education programs in developing countries. 
For examples, see Kremer, Miguel, and Thornton (2009) and the website of MIT’s Poverty Action Labo- 
ratory (http://www.povertyactionlab.org). Deaton (2010) provides a thoughtful critique of this research. 
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If the actual treatment level_X; is random, then the treatment effect can be esti- 
mated by regressing the outcome on the treatment. If the assigned treatment Z; 
is random but the actual treatment X; is partly determined by individual choice, 
then the causal effect can be estimated by instrumental variables regression, 
using Z; as an instrument. If the treatment (or assigned treatment) is random, 
conditional on some variables W, those control variables need to be included 
in the regressions. 

In a quasi-experiment, variations in laws or circumstances or accidents of 
nature are treated as if they induce random assignment to treatment and con- 
trol groups. If the actual treatment is as-if random, then the causal effect can be 
estimated by regression (possibly with additional pretreatment characteristics 
as regressors); if the assigned treatment is as-if random, then the causal effect 
can be estimated by instrumental variables regression. 

Regression discontinuity estimators are based on quasi-experiments in which 
treatment depends on whether an observable variable crosses a threshold 
value. 

A key threat to the internal validity of a quasi-experimental study is whether 
the as-if randomization actually results in exogeneity. Because of behavioral 
responses, the regression error may change in response to the treatment 
induced by the quasi-experiment, so the treatment is not exogenous. 

When the treatment effect varies from one individual to the next, the OLS 
estimator is a consistent estimator of the average causal effect if the actual 
treatment is randomly assigned or as-if randomly assigned. However, the 
instrumental variables estimator is a weighted average of the individual treat- 
ment effects, where the individuals for whom the instrument is most influential 
receive the greatest weight. 
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13.1 


13.2 


13.3 


13.4 


13.5 


A researcher studying the effects of a new fertilizer on crop yields plans 
to carry out an experiment in which different amounts of the fertilizer are 
applied to 100 different one-acre parcels of land. There will be four treatment 
levels. Treatment level 1 is no fertilizer, treatment level 2 is 50% of the manu- 
facturer’s recommended amount of fertilizer, treatment level 3 is 100%, and 
treatment level 4 is 150%. The researcher plans to apply treatment level 1 to 
the first 25 parcels of land, treatment level 2 to the second 25 parcels, and so 
forth. Can you suggest a better way to assign treatment levels? Why is your 
proposal better than the researcher’s method? 


A clinical trial is carried out for a new cholesterol-lowering drug. The drug 
is given to 500 patients, and a placebo is given to another 500 patients, using 
random assignment of the patients. How would you estimate the treatment 
effect of the drug? Suppose you had data on the weight, age, and sex of each 
patient. Could you use these data to improve your estimate? Explain. Suppose 
you had data on the cholesterol level of each patient before he or she entered 
the experiment. Could you use these data to improve your estimate? Explain. 


Researchers studying the STAR data report anecdotal evidence that school 
principals were pressured by some parents to place their children in the small 
classes. Suppose some principals succumbed to this pressure and transferred 
some children into the small classes. How would such transfers compromise 
the internal validity of the study? Suppose you had data on the original ran- 
dom assignment of each student before the principal’s intervention. How 
could you use this information to restore the internal validity of the study? 


What are experimental effects? How can such effects create bias in treatment 
effects? What can a researcher do to reduce the bias? 


Consider the quasi-experiment described in Section 13.4 involving the draft 
lottery, military service, and civilian earnings. Explain why there might be 
heterogeneous effects of military service on civilian earnings; that is, explain 
why 6; in Equation (13.9) depends on i. Explain why there might be hetero- 
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geneous effects of the lottery outcome on the probability of military service; 
that is, explain why 7; in Equation (13.11) depends on i. If there are heteroge- 
neous responses of the sort you described, what behavioral parameter is being 
estimated by the TSLS estimator? 


Exercises 


13.1 


13.2 


13.3 


13.4 


How would you calculate the small class treatment effect from the results in 
Table 13.1? Can you distinguish this treatment effect from the aide treatment 
effect? How would you have to change the program to correctly estimate both 
effects? 


For the following calculations, use the results in column (3) of Table 13.2. 
Consider two classrooms, A and B, which have identical values of the regres- 
sors in column (3) of Table 13.2, except that: 


a. Classroom A is a small class, and classroom B is a regular-sized class. 
Construct a 90% confidence interval for the expected difference in 
average test scores. 


b. Classroom A has a teacher with 6 years of experience, and classroom B 
has a teacher with 12 years of experience. Construct a 95% confidence 
interval for the expected difference in average test scores. 


c. Classroom A is a small-sized class with a teacher with 6 years of expe- 
rience, and classroom B is a regular-sized class with a teacher with 
12 years of experience. Construct a 95% confidence interval for the 
expected difference in average test scores. (Hint: In STAR, the teachers 
were randomly assigned to the different types of classrooms.) 


d. Why is the intercept missing from column (4)? 


Suppose that, in a randomized controlled experiment of the effect of an SAT 
preparatory course on SAT scores, the following results are reported: 


Treatment Group Control Group 
Average SAT score (X) 1348 1395 
Standard deviation of SAT score (sy) 873 82.1 
Number of men 60 40 
Number of women 40 60 


a. Estimate the average treatment effect on test scores. 


b. Is there evidence of nonrandom assignment? Explain. 


A new law will increase minimum wages in City A next year but not in City B, 
a city much like City A. You collect employment data from a random selected 


13.5 


13.6 
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sample of restaurants in cities A and B this year, and you plan to return and 
collect data at restaurants next year. Let Y;, denote the employment level at 
restaurant i in year t. 


a. Suppose you design your analysis so you sample the same restaurants this 
year and next year. Explain how you will use the data to estimate the aver- 
age causal effect of the minimum wage increase on restaurant employment. 


b. Suppose you design your analysis so you sample different, independently 
selected restaurants this year and next year. Explain how you will use 
the data to estimate the average causal effect of the minimum wage 
increase on restaurant employment. 


c. Which sampling design, using the same restaurants in (a) or using different 
restaurants in (b), is likely to yield a more precise estimate of the average 
causal effect? (Hint: You might find it useful to solve Exercise 13.6 first.) 


Consider a study to evaluate the effect on college student grades of dorm 
room Internet connections. In a large dorm, half the rooms are randomly 
wired for high-speed Internet connections (the treatment group), and final 
course grades are collected for all residents. Which of the following pose 
threats to internal validity, and why? 


a. Midway through the year all the male athletes move into a fraternity and 
drop out of the study. (Their final grades are not observed.) 


b. Engineering students assigned to the control group put together a local 
area network so that they can share a private wireless Internet connec- 
tion that they pay for jointly. 


c. The art majors in the treatment group never learn how to access their 
Internet accounts. 


d. The economics majors in the treatment group provide access to their 
Internet connection to those in the control group, for a fee. 


Suppose there are panel data for T = 2 time periods for a randomized con- 
trolled experiment, where the first observation (t = 1) is taken before the 
experiment and the second observation (t = 2) is for the posttreatment 
period. Suppose the treatment is binary; that is, suppose X; = 1 if the i indi- 
vidual is in the treatment group and t = 2, and X; = 0 otherwise. Further 
suppose the treatment effect can be modeled using the specification 


Yı = a + BX + Uip 


where a; are individual-specific effects with a mean of 0 and a variance of o? 
and u; is an error term, where u; is homoskedastic, cov (ujn, uj) = 0, and 
cov (uj, a;) = 0 for all i. Let Biifferences denote the differences estimator — 
that is, the OLS estimator in a regression of Y; on X; with an intercept— 
and let pina denote the differences-in-differences estimator —that is, 
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13.8 


13.9 


13.10 


the estimator of 6, based on the OLS regression of AY; = Yp — Y; against 
AX; = Xn — X; and an intercept. 


a. Show that n var ( ferences) —> (02 + 0%) /var( Xz). (Hint: Use the 
homoskedasticity-only formulas for the variance of the OLS estimator in 
Appendix 5.1.) 


b. Show that n var (ĝin -i —> 26? /var( Xp). (Hint: Note that 
Xn — Xa = Xn. Why?) 

c. Based on your answers to (a) and (b), when would you prefer the 
differences-in-differences estimator over the differences estimator, based 
purely on efficiency considerations? 


Suppose you have panel data from an experiment with T = 2 periods (so 
t = 1,2). Consider the panel data regression model with fixed individual and 
time effects and individual characteristics W; that do not change over time. Let 
the treatment be binary, so that X, = 1 for t = 2 for the individuals in the treat- 
ment group and X;, = 0 otherwise. Consider the population regression model 


Yı = aj ByAy > Bol De Wi) + BoD, + vip 


where a; are individual fixed effects, D, is the binary variable that equals 1 
if t = 2 and equals 0 if t = 1, D, X W; is the product of D, and W, and the 
a’s and #’s are unknown coefficients. Let AY; = Yp — Y}. Derive Equation 
(13.6) (in the case of a single W regressor, so r = 1) from this population 
regression model. 


Suppose you have the same data as in Exercise 13.7 (panel data with two 
periods, n observations), but ignore the W regressor. Consider the alternative 
regression model 


Ya = Bo + PiXu + BoGi + BD, + Uin 


where G; = 1 if the individual is in the treatment group and G; = 0 if the 
individual is in the control group. Show that the OLS estimator of £; is the 
differences-in-differences estimator in Equation (13.4). (Hint: See Section 8.3.) 


Derive the final equality in Equation (13.10). (Hint: Use the definition of the 
covariance, and remember that, because the actual treatment_X; is random, 64; 
and X; are independently distributed.) 


Consider the regression model with heterogeneous regression coefficients 
Y; = Bo + Biri + Vis 
where (v; X;, B1 are i.i.d. random variables with B, = E(;). 


a. Show that the model can be written as Y, = By + BX; + u; where 
u; = (Bu ~ Pi) + vi 
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b. Suppose X; is randomly assigned, so that E[ B,;|X;] = 6; and 
E(v;|X;] = 0. Show that E[u;|X;] = 0. 
c. Show that assumption 1 and assumption 2 of Key Concept 4.3 are satisfied. 


d. Suppose outliers are rare, so that (u;, X;) have finite fourth moments. Is 
it appropriate to use OLS and the methods of Chapters 4 and 5 to esti- 
mate and carry out inference about the average values of Bp; and 64;? 


e. Now suppose X; is not randomly assigned, that E[v;|X;] = 0, but that 
Bii and X; are positively correlated, so that observations with larger- 
than-average values of X; tend to have larger-than-average values of 
Bii- Are the assumptions in Key Concept 4.3 satisfied? If not, which 
assumption(s) is (are) violated? Will the OLS estimator of B; be 
unbiased for E(B)? 


13.11 Results of a study by McClelan, McNeill, and Newhouse are reported in 
Chapter 12. They estimate the effect of cardiac catheterization on patient sur- 
vival times. They instrument the use of cardiac catheterization by the distance 
between a patient’s home and a hospital that offers the treatment. Do you 
think the local average treatment effect differs from the average treatment effect? 


13.12 Consider the potential outcomes framework from Appendix 13.3. Suppose X; 
is a binary treatment that is independent of the potential outcomes Y,(1) and 
Y,(0). Let TE; = Y;,(1) — Y; (0) denote the treatment effect for individual i. 


a. Can you consistently estimate E [Y,(1)] and E[Y,(0)]? If yes, explain 
how; if not, explain why not. 

b. Can you consistently estimate E (TE;)? If yes, explain how; if not, 
explain why not. 


c. Can you consistently estimate var[Y;,(1)] and var[ Y; (0)]? If yes, explain 
how; if not, explain why not. 

d. Can you consistently estimate var(TE;)? If yes, explain how; if not, 
explain why not. 


e. Do you think you can consistently estimate the median treatment effect 
in the population? Explain. 


Empirical Exercises 


E13.1 A prospective employer receives two resumes: a resume from a white job appli- 
cant and a similar resume from an African American applicant. Is the employer 
more likely to call back the white applicant to arrange an interview? Mari- 
anne Bertrand and Sendhil Mullainathan carried out a randomized controlled 
experiment to answer this question. Because race is not typically included on 
a resume, they differentiated resumes on the basis of “white-sounding names” 
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(such as Emily Walsh or Gregory Baker) and “African American-sounding 
names” (such as Lakisha Washington or Jamal Jones). A large collection of fic- 
titious resumes was created, and the presupposed “race” (based on the “sound” 
of the name) was randomly assigned to each resume. These resumes were sent 
to prospective employers to see which resumes generated a phone call (a call- 
back) from the prospective employer. Data from the experiment and a detailed 
data description are on the text website, http://www.pearsonglobaleditions 
-com, in the files Names and Names_Description.® 


a. Define the callback rate as the fraction of resumes that generate a phone 
call from the prospective employer. What was the callback rate for 
whites? For African Americans? Construct a 95% confidence interval 
for the difference in the callback rates. Is the difference statistically sig- 
nificant? Is it large in a real-world sense? 


b. Is the African American/white callback rate differential different for 
men than for women? 


ce. What is the difference in callback rates for high-quality versus low- 
quality resumes? What is the high-quality/low-quality difference for 
white applicants? For African American applicants? Is there a significant 
difference in this high-quality/low-quality difference for whites versus 
African Americans? 


d. The authors of the study claim that race was assigned randomly to the 
resumes. Is there any evidence of nonrandom assignment? 


The Project STAR Data Set 


The Project STAR public access data set contains data on test scores, treatment groups, and 
student and teacher characteristics for the 4 years of the experiment, from academic year 
1985-1986 to academic year 1988-1989. The test score data analyzed in this chapter are the 
sum of the scores on the math and reading portions of the Stanford Achievement Test. The 
binary variable “Boy” in Table 13.2 indicates whether the student is a boy (=1) or girl (=0); 
the binary variables “Black” and “Race other than black or white” indicate the student’s race. 
The binary variable “Free lunch eligible” indicates whether the student is eligible for a free 
lunch during that school year. The “Teacher’s years of experience” is the total years of experi- 
ence of the teacher whom the student had in the grade for which the test data apply. The data 
set also indicates which school the student attended in a given year, making it possible to 


construct binary school-specific indicator variables. 


‘These data were provided by Professor Marianne Bertrand of the University of Chicago and were used in 
her paper with Sendhil Mullainathan, “Are Emily and Greg More Employable than Lakisha and Jamal? A 
Field Experiment on Labor Market Discrimination,” American Economic Review, 2004, 94(4): 991-1013. 
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IV Estimation When the Causal Effect Varies 
Across Individuals 


This appendix derives the probability limit of the TSLS estimator in Equation (13.12) when 
there is population heterogeneity in the treatment effect and in the influence of the instrument 
on the receipt of treatment. Specifically, we assume that the IV regression assumptions in 
Key Concept 12.4 hold except that treatment effects are heterogeneous, as in Equations (13.9) 
and (13.11). Further assume that Z; is randomly assigned or as-if randomly assigned, so 
(Ui, Vi, Tii Bii) are distributed independently of Z; also assume that E(m;) # 0 (so the instru- 
ment is relevant on average). 

Because (X; Y;,Z;),i = 1,...,n, are iid. with four moments, the law of large numbers in 


Key Concept 2.6 applies and 


A S 
BISLs = “ZY _p, COZY (13.13) 
SZX OZX 


(See Appendix 3.3 and Exercise 18.2.) The task thus is to obtain expressions for ozy and ozy in 
terms of the moments of 7; and B);.Nowozy = E[(Z; — wz) (X; — wy) ] = El (Zi — wz) Xi). 


Substituting Equation (13.11) into this expression for ozy yields 


ozx = E(Z;— pz) (To + mZ; + vi) 
mE(Z; — uz) + El m:Zi(Z; — wz)] + cov(Z;, vi) 
= 07E(mi), (13.14) 


where the third equality follows because E(Z; — wz) = 0; because Z; and v; are independent, 
so that cov(Z;, v;) = 0; and because m; and Z; are independent, so that E| 7;Z;(Z; — wz) | = 
E(m)E(Z(Z; — az)] = oZE(m). 

Next consider ozy. Substituting Equation (13.11) into Equation (13.9) yields 
Y; = Bo + Biilm + mZ; + vi) + uj, so 


ozy = El (Z; — wz) Y;] 
= E| (Z; — bz) (Bo + But + Bum Z; + Buvi + ui) | 
= PoE(Z; — wz) + ME[Bi(Z;) — uz)] + El ByumiZi(Z; — wz) ] 
+ El Byvi(Z; — wz)] + cov(Z;, u;). (13.15) 


The assumption that ( u;, vi, Bii, 7;) isindependent of Z;,along with the fact that E( Z; — wz) = 0, 
implies the following simplifications for the five terms after the final equality in Equation (13.15): 
BoE(Z; — bz) =9, mE[BulZ; — Bz) ] = mE Bu) E(Z;— uz) =0, El BymZi(Z; — wz) ) = 
E(Bym;)E[Z(Z; — uz)] = E(Bumi)oZ, El Buvi(Z; — uz)] = E(buvi)E(Zi — wz) = 0, 
and cov( Z;, u;) = 0. Thus the final expression in Equation (13.15) simplifies to 


ozy — o3E( Biim). (13.16) 
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Substituting Equations (13.14) and (13.16) into Equation (13.13) yields 
TES 2s oE( Bym;)/0ZE(m;) = E( Bumi) /E(m;), which is the result stated in Equa- 
tion (13.12). 


The Potential Outcomes Framework 
for Analyzing Data from Experiments 


This appendix provides a mathematical treatment of the potential outcomes framework dis- 
cussed in Section 13.1. The potential outcomes framework, combined with a constant treat- 
ment effect, implies the regression model in Equation (13.1). If assignment is random, 
conditional on covariates, the potential outcomes framework leads to Equation (13.2) and 
conditional mean independence. We consider a binary treatment with X; = 1 indicating 
receipt of treatment. 

Let Y;(1) denote individual 7’s potential outcome if treatment is received, and let Y;(0) 
denote the potential outcome if treatment is not received, so individual i’s treatment effect is 
Y,(1) — Y;(0).The average treatment effect in the population is E[ Y,(1) — Y;(0) ]. Because 
the individual is either treated or not, only one of the two potential outcomes is observed. The 


observed outcome, Y; is related to the potential outcomes by 
Y, = ¥,(1)X; + ¥,(0)(1 - X;). (13.17) 


If some individuals receive the treatment and some do not, the expected difference 
in observed outcomes between the two groups is E(Y|X; = 1) — E(Y|X = 0) = 
EL Y,(1) |X; = 1] — EL Y,(0)|X; = 0]. This is true no matter how treatment is determined 
and simply says that the expected difference is the mean treatment outcome for the treated 
minus the mean no-treatment outcome for the untreated. 

If the individuals are randomly assigned to the treatment and control groups, then X; is 
distributed independently of all personal attributes and in particular is independent of 
[ Y;(1), ¥,(0) ]. With random assignment, the mean difference between the treatment and con- 


trol groups is 


| i 
(1)] — E[¥i(0)] = E[Y(1) — ¥1(0)], (13.18) 


where the second equality uses the fact that [| Y;(1), Y¥;(0) ] are independent of X; by random 
assignment and the third equality uses the linearity of expectations [Equation (2.29)]. Thus if 
X; is randomly assigned, the mean difference in the experimental outcomes between the two 
groups is the average treatment effect in the population from which the subjects were drawn. 
The potential outcome framework translates directly into the regression notation used 
throughout this text. Let u; = Y,(0) — E[Y;,(0) ], and denote E[ Y;(0) ] = By. Also denote 
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Y,(1) — Y;(0) = Bı, so that £; is the treatment effect for individual i. Starting with Equation 
(13.17), we have 


Y= ¥(1)X; + ¥(0)(1 — X) 
= ¥(0) + [¥(1) — Y(0) |X; 
EL Y:(0)] + [¥i(1) — ¥(0) |X; + {¥(0) — EL¥;(0)}} 
= Bo + Buk; + u; (13.19) 


Thus, starting with the relationship between observed and potential outcomes in Equation (13.17) 
and simply changing notation, we obtain the random coefficients regression model in 
Equation (13.9). If X; is randomly assigned, then X; is independent of [ Y;(1), Y,(0) ] and thus is 
independent of 6; and u;. If the treatment effect is constant, then B,; = £; and Equation (13.9) 
becomes Equation (13.1). If the outcome Y; is measured with error, then the first line of Equation 
(13.19) would include a measurement error term, which would be subsumed in w; in the final line. 

As discussed in Section 13.1, in some designs X; is randomly assigned based on the value 
of a third variable, W;. If W; and the potential outcomes are not independent, then, in general, 
the mean difference between groups does not equal the average treatment effect; that is, Equa- 
tion (13.18) does not hold. However, random assignment of X; given W; implies that, condi- 
tional on W,, X; and [ ¥;(1), Y;(0)] are independent. This condition—that [| Y;(1), Y;(0) } is 
independent of X;, conditional on W;—is sometimes called unconfoundedness. 

If the treatment effect does not vary across individuals and if E( Y |X, W,) is linear, then 
unconfoundedness implies conditional mean independence of the regression error in Equa- 
tion (13.2). It follows from Appendix 6.5 that, under these conditions, the OLS estimator of 64 
in Equation (13.2) is unbiased, although, in general, the OLS estimator of y is biased because 
E(u;|W,) # 0. To show conditional mean independence under these conditions, let 
Y,(0) = Bo + yW; + u; where y is the causal effect (if any) on Y,(0) of W, and let 
Y,(1) — Y;(0) = f; (constant treatment effect). Then the logic leading to Equation (13.19) 
yields Y, = By + BX, + yW; + u; which is Equation (13.2). Thus E(u;|X;, W) = 
E[Y;(0) — By — yW,|X, Wi] = EĻY;(0) — By — yW;|W;] = E(w;|W,), where the second 
equality follows from unconfoundedness, which implies that E[ Y;(0) |X, W] = EL Y,(0)| W)]. 
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4 Regressors and Big Data 


hapter 4 began with two different questions about student performance at 
a schools. A superintendent wanted to know whether test scores would 
improve if she reduced the student-teacher ratio in her schools—and if they would, by 
how much. A father, trying to decide where to live, wanted to predict which schools 
had the highest-performing students. Answering the superintendent's question 
requires you to estimate the causal effect on test scores of the student-teacher ratio, 
and estimating causal effects is the focus of Chapters 4-13. In contrast, answering the 
father’s question requires you to predict school test scores given one or more relevant 
variables—in Chapter 4, the student-teacher ratio, extended in Chapter 6 to include 
additional information on school and community characteristics. 

Statistical prediction entails using data to estimate a prediction model and then 
applying that model to new, out-of-sample observations. The goal is accurate out-of- 
sample prediction. In a prediction problem, there are neither specific regressors of 
interest nor control variables; there are only predictors and the variable to be predicted. 

If there are only a handful of predictors, ordinary least squares (OLS) works well if 
the least squares assumptions for prediction in Appendix 6.4 hold. But modern data 
sets often have many predictors. For example, the empirical application in this chapter 
is the prediction of school-level test scores using data on school and community 
characteristics. We use data on 3932 elementary schools in California; half of these 
observations are used to estimate prediction models, while the other half are reserved 
to test their performance.' For most of the chapter, we consider a data set with 817 
predictors, which is expanded in Section 14.6 to 2065 predictors. This problem of 
predicting school test scores is typical of many prediction applications using cross- 
sectional data, such as forecasting sales for a business, predicting patient-level out- 
comes of medical procedures, or predicting demand for services by state and local 
government. In such applications, the number of predictors can be nearly as large as, 
or even larger than, the number of observations. 

With so many predictors, OLS overfits the data and makes poor out-of-sample 
predictions. Fortunately, it is possible to improve upon OLS by using estimators that 
are broadly referred to as shrinkage estimators. These estimators are biased (they 
“shrink” the estimator), and the coefficients, in general, do not have a causal interpre- 
tation. Remarkably, however, when there are many predictors, introducing bias can 
reduce the variance of the estimator sufficiently that the overall out-of-sample predic- 
tion accuracy is improved. 


lIn California, a school district typically contains multiple individual schools. The test score data set used 
in Chapters 4-9 contains district-level data, while the data used here are for individual schools. 
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This chapter considers prediction using cross-sectional data sampled from a larger 
population (shoppers, patients, schools) to predict outcomes for members of the pop- 
ulation not in the estimation sample. A related problem is prediction of future events, 
such as the number of jobs the economy will add next month. Predictions about the 
future are typically referred to as forecasts, and we adopt that terminology. Forecasting 
uses time series data, which introduce additional notation and technicalities. Forecast- 
ing is taken up in Part IV. 

The availability of many predictors is one of the opportunities provided by very 
large data sets. The field of analyzing big data sets goes by multiple names, including 
machine learning, data science, and the term we shall use, big data. 


What Is “Big Data”? 


Data sets can be big in the sense of having many observations, or having many 
predictors relative to the number of observations, or both. Big data sets can be 
nonstandard—for example, containing text or images. 

Big data sets make available new families of applications. One such family, which 
is the focus of this chapter, is prediction when the number of predictors k is large 
compared to the number of observations n. The prediction methods considered in 
this chapter start with linear regression, so having many predictors corresponds to 
having many regressors. This situation can arise if one has many distinct primitive 
predictors, or it can arise if one is considering predictions that are nonlinear functions 
of the primitive predictors. Even if one starts with only a few dozen primitive predic- 
tors, including squares, cubes, and interactions very quickly expands the number of 
regressors into the hundreds or thousands. 

A second family of applications that arises with big data is categorization. We 
have encountered this problem before, in the context of regression with a binary 
dependent variable. The logit and probit models of Chapter 11 predict the probability 
that the dependent variable is 1—in the empirical application, the probability that a 
loan application is denied. An alternative framing of this problem is to divide the 
data set into two groups, or categories: those applications that are likely to be denied 
and those that are likely to be accepted. From a prediction perspective, the aim is to 
develop a model of loan applications that mimics the decision-making process of a 
loan officer. Said differently, by fitting that model, a machine (computer) would have 
learned (estimated) the decision process made by a loan officer. Using that machine 
learning model, the computer then can make the accept/deny decision itself for 
future applications. Indeed, the online home loan application industry relies heavily 
on machine learning, applied to very large data sets on loan applications, to assess 
the eligibility of an applicant for a mortgage. 

A third family of applications concerns testing multiple hypotheses. In the 
regression context, for example, there might be a potentially large set of coefficients 
representing different treatments, and the econometrician might be interested in 
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ascertaining which, if any, of these treatments is effective. Because the F-statistic tests 
a joint hypothesis on a group of coefficients, it is not well suited for the problem of 
testing many treatments to find out which of the treatments is effective. Testing many 
individual hypotheses with the aim of determining which treatment effect is nonzero 
requires specialized methods that have been developed for big data applications. 

A fourth family of applications concerns handling nonstandard data, such as text 
and images. The key step is turning these nonstandard data into numerical data, 
which can then be handled using techniques for high-dimensional data sets. 
Section 14.7 discusses methods for handling text data. 

A fifth, related family of applications is pattern recognition, such as facial recog- 
nition or translating text from one language to another. This area has seen great 
progress using procedures such as “deep learning,” which are in essence highly non- 
linear models estimated (“trained”) using very many observations. 

A common feature of all of these problems is that handling large data sets cre- 
ates computational challenges. Those challenges include storing and accessing large 
data sets efficiently and developing fast algorithms for estimating models. These 
computational issues are important; however, we do not address them in this chapter 
and instead leave them to computer science curricula. 

The results of machine learning applied to large data sets are increasingly part 
of our everyday world. Examples range from software that helps doctors make diag- 
noses to techniques that target online advertisements to facial recognition algorithms 
that are used by law enforcement officials. In economics, applications include esti- 
mating local incomes based on satellite data, predicting sales for a firm using detailed 
customer data, interpreting network data on social media sites, searching for patterns 
in high-frequency asset price databases to use in computerized trading algorithms, 
and forecasting macroeconomic growth using up-to-the-minute data. Increasingly, 
computerized analysis of nonstandard data, especially text data, is playing a role in 
econometric applications. 

This chapter cannot cover all these uses of big data, so it focuses on one of the 
most important for economic applications: the many-predictor problem. Although 
the nomenclature of this growing field—machine learning, data science, and so 
forth—makes it seem difficult and new, the methods discussed in this chapter are, at 
their core, extensions of linear regression analysis that are tailored to the opportuni- 
ties and challenges of large data sets. 


The Many-Predictor Problem and OLS 


This chapter considers the problem of predicting test scores for a school using vari- 
ables describing the school, its students, and its community. The full data set consists 
of data gathered on 3932 elementary schools in the state of California in 2013. The 
task is to use these data to develop a prediction model that will provide good out-of- 
sample predictions—that is, predictions for schools not in the data set. To simulate 
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the out-of-sample prediction problem, for most of the chapter we use half the obser- 
vations (n = 1966) for estimating prediction models. The remaining half of the 
observations are reserved as a test data set to assess how the models perform and are 
not used until Section 14.6. 

The variable to be predicted is the average fifth-grade test score at the school. 
The primary data set contains 817 distinct variables relating to school and community 
characteristics; these variables are summarized in Table 14.1. For comparison, smaller 
and larger data sets are used in Section 14.6. The data are described in more detail in 
Appendix 14.1. 

If only the main variables in Table 14.1 were used, there would be 38 regressors. 
The analysis of the district test score data in Section 8.4, however, revealed several 
interesting nonlinearities and interactions in the test score regressions. For example, 
the regressions in Table 8.3 indicate that there is a nonlinear relationship between 
test scores and the student-teacher ratio and, in addition, that this relationship differs 
depending on whether there are a large number of English learners in the district. In 
Section 8.4, these nonlinearities were handled by including third-degree polynomials 
of the student-teacher ratio and interaction terms. As laid out in Table 14.1, including 
interactions, squares, and cubes increases the number of predictors to 817 In 
Section 14.6, we consider an even larger data set with 2065 predictors, which exceeds 
the 1966 observations in the estimation sample! Regression with 817 regressors, not 
to mention 2065 regressors, goes well beyond anything attempted so far in this text. 

A natural starting point is OLS. Unfortunately, OLS can produce quite poor pre- 
dictions when the number of predictors is large relative to the sample size. Fortu- 
nately, there are estimators other than OLS that can produce more reliable predictions 


j 
Variables in the 817-Predictor School Test Score Data Set 


Main variables (38) 


Fraction of students eligible for free or 
reduced-price lunch 


Fraction of students eligible for free lunch 
Fraction of English learners 

Teachers’ average years of experience 
Instructional expenditures per student 
Median income of the local population 
Student-teacher ratio 

Number of enrolled students 


Fraction of English-language proficient 
students 


Ethnic diversity index 


+ Squares of main variables (38) 

+ Cubes of main variables (38) 

+ All interactions of main variables (38 x 37/2 = 703) 

| Total number of predictors = k = 38 + 38 + 38 + 703 = 817 


Ethnicity variables (8): fraction of students who 
are American Indian, Asian, Black, Filipino, 
Hispanic, Hawaiian, two or more, none reported 


Number of teachers 
Fraction of first-year teachers 
Fraction of second-year teachers 


Part-time ratio (number of teachers divided by 
teacher full-time equivalents) 


Per-student expenditure by category, district 
level (7) 


Per-student expenditure by type, district level (5) 
Per-student revenues by revenue source, district level (4) 
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when the number of predictors relative to the sample size is large. This fact might 
seem surprising in light of the Gauss—Markov theorem, which says that the OLS esti- 
mator has the lowest variance of all unbiased estimators as long as the Gauss—Markov 
conditions hold (Appendix 5.2). The reason for this surprising result, and the reason 
it does not violate the Gauss—Markov theorem, is that these alternative estimators are 
biased. Although the estimators are biased, their variance is sufficiently smaller than 
the variance of the OLS estimator for them to produce better predictions. 


The Mean Squared Prediction Error 


To compare prediction models, we need a quantitative measure of predictive accu- 
racy. As we have throughout this text, we will use the square of the error—in this case, 
the error from out-of-sample predictions. Using the squared prediction error means 
that small errors receive little weight but large errors receive great weight. This makes 
sense in many prediction problems, where small errors have negligible impact but very 
large errors can undercut the usefulness and credibility of the prediction. 

The mean squared prediction error (MSPE) is the expected value of the square 
of the prediction error that arises when the model is used to make a prediction for 
an observation not in the data set. 

Stated mathematically, the MSPE is 


MSPE = EL Y°” — ¥(X°*)}?, (14.1) 


where X°% and Y°” are out-of-sample (“oos”) observations on X and Y and Y(x) is 
the predicted value of Y for a value x of the predictors. As usual, X is shorthand for 
the k separate predictors. The notation of Equation (14.1) is taken from Appendix 6.4 
(the least squares assumptions for prediction). The notation distinguishes between 
the n observations (X;, Y;),i = 1,...,,used to estimate the prediction model that 
produces Y(x) and the out-of-sample observation for which the prediction is made. 
The out-of-sample observation is not used to estimate the prediction model. 

From the perspective of minimizing the MSPE, the best possible prediction is the 
conditional mean —that is, E( Y°°*| X°) (Appendix 2.2 and Exercise 14.8). This 
best-possible prediction, E( Y°®| X°°’), is sometimes called the oracle prediction. 
Because the conditional mean is unknown, the oracle prediction cannot be used in 
practice (it is infeasible); however, it is the benchmark against which to judge all 
feasible predictions. In the regression model, the oracle prediction corresponds to the 
prediction that would be made using the true (unknown) population regression 
coefficients. 

The MSPE embodies two sources of prediction errors. First, even if the condi- 
tional mean were known, the prediction would be imperfect: The oracle prediction 
makes the prediction error, Y°” — E(Y°®”|X°®). Second, B(Y°"|X°*) is 
unknown, and estimating its parameters — that is, estimating the coefficients of the 
prediction model Y(x) —introduces an additional source of error. 
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The First Least Squares Assumption for Prediction 


The school test score application uses data on some (but not all) California schools to 
estimate the prediction model. We can have confidence that this prediction model will 
generalize to other California schools; however, we have much less confidence that it will 
apply to schools in Europe and even less confidence that it will apply to schools in India. 

The first least squares assumption for prediction makes this intuition precise. 
This assumption, which was introduced in Appendix 6.4, states that the out-of-sample 
observation is drawn from the same distribution as the in-sample observations used 
to estimate the model: 


First least squares assumption for prediction: (X°°°, Y°°*) are randomly 
drawn from the same population distribution as the estimation sample 
(X, Y), i= 1,...,7. 


Because the in- and out-of-sample observations are drawn from the same distri- 
bution, the conditional mean, E(Y|X), is the oracle prediction for both in- and out- 
of-sample observations. 

The first least squares assumption for prediction is a statement about external 
validity: The in-sample model can be generalized to the out-of-sample observation 
of interest. 

Although we refer to this assumption as the first least squares assumption for 
prediction, the requirement applies for estimation methods other than least squares. 
This condition is assumed to hold for the remainder of this chapter. 


The Predictive Regression Model 
with Standardized Regressors 


This chapter uses a modified version of the linear regression model in which the 
regressors are all standardized; that is, they are transformed to have mean 0 and vari- 
ance 1. In addition, the dependent variable is transformed to have mean 0. By using 
standardized regressors, all the regression coefficients have the same units, a property 
used in the methods of Sections 14.3-14.5. 

Let (Xj;,..., Xk Y;),i = 1,..., n, denote the data as originally collected, 
where X ji is the i” observation on the j" original regressor. The standardized regres- 
sors are X; = (X; — Hx;)/Tx;, where py and gy; are, respectively, the population 
mean and standard deviation of Xi, ies , Kas The transformed (demeaned) depen- 
dent variable is Y, = Y} — uy, where uy is the population mean of Yj,..., Yy. 

With this notation, the standardized predictive regression model is the regression 
of Y, which has mean 0, on the k standardized X’s: 


Y; = BX; + BoXd; Erse T BX ki + Ui. (14.2) 


The intercept is excluded from Equation (14.2) because all the variables have 
mean 0. 
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Because the regressors are standardized, the regression coefficients have the 
same units: 6; is the difference in the predicted value of Y associated with a one stan- 
dard deviation difference in Xs holding constant the other X’s. 

Because the focus of this chapter is prediction, we adopt throughout the 
prediction interpretation of the regression model in Appendix 6.4; that is, 
E(Y|X) = Di, BXand E(u|X) = 0. 

As usual, the linear structure in Equation (14.2) means that the predictions are 
linear in the coefficients; however, the regression function can be nonlinear in the 
predictors because X can include nonlinear terms such as squares or interactions. 


The MSPE in the standardized predictive regression model. In the standardized 
regression model in Equation (14.2), the prediction for the out-of-sample value of 
the predictors is P(X”) = B, X9 +... + ÊL X?”. The prediction error is 
a As t a a a a to BeBe bk 
where the final expression obtains using Equation (14.2), and u°® is the value of the 
error u for the out-of-sample observation. Because u°°* is independent of the data 
used to estimate the coefficients and is uncorrelated with X°°’, the MSPE in 
Equation (14.1) for the standardized predictive regression model can be written as 
the sum of two components: 


MSPE = o} + E[(B, — B1) X3” +... + (Be — BI Xg" F. (14.3) 

The first term in Equation (14.3), 07, is the variance of the oracle prediction 
error—that is, of the prediction error made using the true (unknown) conditional 
mean, E(Y|X). 

The second term in Equation (14.3) is the contribution to the prediction error 
arising from the estimated regression coefficients. This second term represents the 
cost, measured in terms of increased mean squared prediction error, of needing to 
estimate the coefficients instead of using the oracle prediction. 

Because the mean square is the sum of the variance and the square of the bias 
(Equation (2.33)), the second term in Equation (14.3) is the sum of the variance of the 
prediction arising from estimating 8 and the squared bias of the prediction. When it 
comes to determining which estimator to use, the goal is to make this second term in 
Equation (14.3) as small as possible. As we shall see, when there are many predictors, 
this entails trading off the bias of the estimated coefficients against their variance. 


Standardization using the sample means and variances. In practice, the population 
means and standard deviations of the original variables are not known. Accordingly, 
the in-sample means and variances are used to standardize the regressors, and the 
in-sample mean is subtracted from the dependent variable. 

Because the regressors are standardized and the dependent variable is demeaned, 
an additional step is needed to produce the prediction for an out-of-sample observa- 
tion. Specifically, the out-of-sample observation on the predictors must be standard- 
ized using the in-sample mean and standard deviation, and the in-sample mean of 
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the dependent variable must be added back into the prediction. Formulas are given 
in Appendix 14.5. 


The MSPE of OLS and the Principle of Shrinkage 


In the special case that the regression error u in Equation (14.2) is homoskedastic, 
the MSPE of OLS is given by 


k 2 
MSPEors = (1 + = Joi. (14.4) 


The approximation in Equation (14.4) holds exactly in some special cases (Exer- 
cise 14.12), and it holds more generally as an approximation when n is large and k/n 
is small. In the case of a single regressor, Equation (14.4) is derived in Appendix 14.2. 
The derivation of Equation (14.4) for general k uses matrix algebra and is given in 
Appendix 19.7. 

This expression has a simple interpretation. As discussed following Equa- 
tion (14.3), the MSPE of the oracle prediction — that is, the prediction using the true 
value of B—is a}. When the k regression coefficients are estimated by OLS, the 
MSPE increases by the factor (1 + k/n) relative to the best-possible MSPE. Thus 
the cost, as measured by the MSPE, of using OLS depends on the ratio of the number 
of regressors to the sample size. 

For example, in the school test score application, suppose the 38 main regressors 
in Table 14.1 are used to predict test scores. Although 38 regressors sounds like a lot, 
k/n = 38/1966 = 0.02,so using OLS entails only a 2% loss in MSPE relative to the 
oracle prediction. In many applications, a loss of 2% might not be important. In the 
data set with 817 regressors, however, k/n = 817/1966 = 0.40, and a 40% deteriora- 
tion is large enough that it is worth investigating estimators that have a lower MSPE 
than OLS. 

Because OLS is unbiased under the prediction interpretation of Equation (14.2), 
the inflation factor (1 + k/n) arises solely from the variance of the OLS estimator. 
Under the Gauss—Markov conditions, the OLS estimator has the smallest variance 
of all linear unbiased estimators. As a result, one might naturally be discouraged 
about making much headway when k/n is large. But a major conceptual break- 
through in the many-predictor problem, dating to the early 1960s, was the discovery 
that if one allows for biased estimators, the estimator variance can be reduced by so 
much that the MSPE can be less than that of OLS. 


The principle of shrinkage. A shrinkage estimator introduces bias by “shrinking” 
the OLS estimator toward a specific number and thereby reducing the variance of 
the estimator. Because the mean squared error is the sum of the variance and the 
squared bias (Equation (2.33)), if the estimator variance is reduced by enough, then 
the decrease in the variance can more than compensate for the increase in the 
squared bias. The result is an estimator with a lower mean squared error than OLS. 


522 


CHAPTER 14 Prediction with Many Regressors and Big Data 


James and Stein (1961) developed the first estimator that achieved this goal of 
reducing the estimator mean squared error by introducing bias. When the regressors 
are uncorrelated, the James-—Stein estimator can be written as Bi = cB, where B is 
the OLS estimator and c is a factor that is less than 1 and depends on the data. 
Because c is less than 1, the James—Stein estimator shrinks the OLS estimator toward 
0 and thus is biased toward 0. It is not surprising that the James—Stein estimator has 
a lower mean squared error than the OLS estimator when the true 6’s are small. 
What James and Stein showed, however, is that if the errors are normally distributed, 
their estimator has a lower mean squared error than the OLS estimator, regardless 
of the true value of B, as long as k = 3. 

James and Stein’s remarkable result is the foundation of many-predictor meth- 
ods used with big data. Their result leads to the family of shrinkage estimators, which 
includes ridge regression and the Lasso estimator, the topics of Sections 14.3 and 14.4, 
respectively. 


Estimation of the MSPE 


The MSPE is a population expectation and thus is unknown. However, it can be 
estimated from a sample of data. Here, we discuss two ways to estimate the MSPE. 
The first, split-sample estimation, draws directly on the definition of the MSPE and 
entails dividing the sample into two subsamples, one for estimation and one for pre- 
diction. The second, called m-fold cross validation, extends this idea but uses the data 
symmetrically and more efficiently by dividing the sample into m subsamples. 


Estimating the MSPE using a split sample. Recall that the MSPE is the variance of 
the prediction error for a randomly drawn X, where the observation is not used to 
estimate £. This definition suggests estimating the MSPE by dividing the data set into 
two parts: an estimation subsample and a “test” subsample used to simulate out-of- 
sample prediction. The estimation subsample is used to estimate £, yielding the esti- 
mate 8, which could be obtained by OLS or some other estimator. This estimate is 
then used to make a prediction Y for each of the Nest Observations in the test sub- 
sample. The MSPE is then estimated using the resulting Mest prediction errors: 
1 x 
MSPE spiit-sample = 5 (Y; m Y;)’. (14.5) 
"test observations in 
test subsample 
Estimating the MSPE by m-fold cross validation. The split-sample procedure treats 
the data asymmetrically by arbitrarily splitting the observations into two subsamples 
that are then used for different purposes. This estimator can be improved by treating 
the data symmetrically. Specifically, the two subsamples can be used to produce 
two different estimators of the MSPE by swapping which subsample is used to esti- 
mate 6 and which is used to estimate the MSPE. 
This idea extends to m different, randomly chosen subsamples. The resulting 
procedure is called m-fold cross validation. In m-fold cross validation, there are m 
separate estimates of the MSPE, each produced by sequentially leaving out one of 
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m-fold Cross Validation 


The m-fold cross-validation estimator of the MSPE is determined according to 


the following six steps. 


L 


Divide the test sample into m randomly chosen subsets of approximately 
equal size. 


. Use the combined subsamples 2,3, . . . , m to compute B „an estimate of B. 


3. Use B and Equation (14.12) to compute predicted values Y and prediction 


errors Y — for the observations in subsample 1. 


. Using subsample 1 as the test sample, estimate the MSPE with the predicted 


values in subsample 1 and Equation (14.5); call this estimate MSPE). 


. Repeat steps 2—4 using subsample 2 as the left-out test sample, then subsam- 
ple 3, and so forth, yielding a total of m estimates MSPE,,i = 1,...,m. 


. The m-fold cross-validation estimator of the MSPE is then estimated by aver- 


aging these m subsample estimates of the MSPE: 


Ss {| 24 i ——aS= 
MSPE m-fold cross validation T Ae n/m MSPE;,, (14.6) 


where n; is the number of observations in subsample i and the factor in paren- 
theses allows for different numbers of observations in the different subsamples. 


14.1 


the m subsamples when estimating 6 and using that reserved subsample to estimate 
the MSPE. The m-fold cross-validation estimator of the MSPE is the average of the 
m subset estimators of the MSPE. The m-fold cross-validation estimator of the MSPE 
is summarized in Key Concept 14.1. 

A loose end in m-fold cross validation is how to choose m. This involves a trade- 
off. A larger value of m produces more efficient estimators of 8 because more obser- 
vations are used each time £ is estimated. From this perspective, ideally one would 
use the so-called leave-one-out cross-validation estimator, for which m = n — 1. But 
a larger value of m means that B must be estimated m times. In applications in which 
k is large (in the hundreds or more), this can take considerable computer time, and 
leave-one-out cross validation takes too long computationally. As a result, the choice 
of m must be made taking into account practical constraints on your and your com- 
puter’s time. In the school test score application in this chapter, we settle on m = 10 
as a practical compromise given the computer we used, so that each subsample esti- 
mator of B uses 90% of the sample. 

The m-fold cross-validation estimator can be used to estimate the MSPE in very 
general settings, regardless of how £ is estimated. It even works for models that can 
be expressed only as algorithms, not in terms of parameters. This general applicability 
makes it widely used in empirical work with big data. 
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14.3 


Ridge Regression 


Sections 14.3 and 14.4 describe two shrinkage estimators that are designed for use 
with many predictors. The method discussed in this section, ridge regression, shrinks 
the estimated parameter to 0 by adding to the sum of squared residuals a penalty 
that increases with the square of the estimated parameter. By minimizing the sum 
of these two terms, which is called the penalized sum of squared residuals, 
ridge regression introduces bias into the estimator but reduces its variance. In 
some applications, ridge regression can result in large improvements in MSPE 
compared to OLS. 


Shrinkage via Penalization and Ridge Regression 


One way to shrink the estimated coefficients toward 0 is to penalize large values of 
the estimate. The ridge regression estimator is based on this idea. Specifically, the 
ridge regression estimator minimizes the penalized sum of squares, which is the sum 
of squared residuals plus a penalty factor that increases with the sum of the squared 
coefficients: 


n k 
SRi48¢(b: ridge) = X, (Y; — br Xi — ... — Be Xi)? + Aridge >, 7, (14.7) 
=i 1 


where Apidge = 0. The parameter Apriage 1S called the ridge shrinkage parameter. The 
ridge regression estimator is the value of b that minimizes S*'48*( b; A Ridge ) 

The first term on the right-hand side of Equation (14.7) is the usual sum of 
squared residuals for a trial coefficient value b. If this were the only term, then the 
ridge and OLS estimators would be the same. The second term, however, is new. This 
second term increases with the sum of the squared coefficients. This second term in 
Equation (14.7) is called a penalty term because it penalizes the estimator for choos- 
ing a large estimate of the coefficient. When the penalty term is scaled by the shrink- 
age parameter and added to the sum of squared residuals, as it is in Equation (14.7), 
the result is called the penalized sum of squared residuals. 

The penalty term shrinks the ridge regression estimator toward 0. Figure 14.1 
shows how ridge penalization works when there is only one regressor. Without the 
penalty, one would minimize the sum of squared residuals, which yields the OLS 
estimator. Adding in the penalty shifts the minimum of the penalized function toward 
0. Thus the estimated ridge coefficient will be closer to 0 than the OLS estimator is; 
that is, the ridge regression estimator is shrunk toward 0. 

The magnitude of the shrinkage depends on the shrinkage parameter A pidge. 
If Aridge = 9, there is no shrinkage, and the ridge regression estimator equals the 
OLS estimator. The larger Àriage, the greater the penalty for a given value of b, and 
the greater the shrinkage of the estimator toward 0. Because we are using the stan- 
dardized predictive regression model, all the coefficients have the same units, so a 
single shrinkage parameter A pidge can be used for all the coefficients. 
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( FIGURE 14.1 | Components of the Ridge Regression Penalty Function 


The ridge regression 
estimator minimizes 
SRidge(h) which is 

the sum of squared 
residuals, SSR(b), plus a 
penalty that increases sRidge (p) 
with the square of the 
estimated parameter. 
The SSR is minimized 
at the OLS estima- 
tor, B. Including the 
penalty shrinks the 


Penalty(b) 


ridge estimator, B°%9", 


toward 0. SSR(b) 


The penalized sum of squared residuals in Equation (14.7) can be minimized 
using calculus to give a simple expression for the ridge regression estimator. This 
formula is derived in Appendix 14.3 for the case of a single regressor. When k > 2, 
the formula is best expressed using matrix notation, and it is given in Appendix 19.7 

In the special case that the regressors are uncorrelated, the ridge regression 


Ap. 1 a 
Ridge — ( Jâ (14.8) 
; 1+ Aridge! D-1 XF Í 


estimator is 


where Ê; is the OLS estimator of £;. In this case, the ridge regression estimator shrinks 
the OLS estimator toward 0, like the James—Stein estimator. When the regressors are 
correlated, the ridge regression estimate can sometimes be greater than the OLS 
estimate although overall the ridge regression estimates are shrunk towards zero. 

When there is perfect multicollinearity, such as when k > n, the OLS estimator 
can no longer be computed, but the ridge estimator can. 


Estimation of the Ridge Shrinkage Parameter 

by Cross Validation 

The ridge regression estimator depends on the shrinkage parameter A pidge. While the 
value of Agidge could be chosen arbitrarily, a better strategy is to pick Àriage SO that the 
ridge regression estimator works well for the data at hand. 
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One might initially think that the shrinkage parameter A pidge could be estimated 
by minimizing S*!48(b; A Ridge ) in Equation (14.7). However, for any trial value of b, 
minimizing S*48¢( b; A Ridge) With respect to Apidge Simply leads to setting Apiage to 0; 
but when Apiage = 0, the ridge regression estimator is just the OLS estimator! The 
reason that this approach yields the OLS estimator is that it provides the best 
in-sample fit—which is given by OLS. In contrast, the goal of prediction is to have a 
good out-of-sample fit—that is, a low MSPE. 

That insight suggests choosing A ridge to minimize the estimated MSPE. This strat- 
egy can be implemented using the m-fold cross-validation estimator of the MSPE 
(Key Concept 14.1). Specifically, suppose you have two candidate values of A pidge— 
for example, 0.1 and 0.2—and choose some value of m. Let B in Key Concept 14.1 
denote the ridge regression estimator using Apigge = 0.1. Given B ; compute the pre- 
dictions in the test sample, and use those predictions to compute MSPE for that 
estimator. Now repeat, but use Apigge = 0.2. You now have two estimates of the 
MSPE, one for Apiage = 0.1 and one for Apidge = 0.2, so choose the value of Apiage 
that provides the lowest estimated MSPE. Repeating these steps for multiple values 
of ARidge yields an estimator of Apidge that minimizes the m-fold cross-validation 
MSPE. Although this estimator could potentially be 0—so that the best ridge estima- 
tor is the OLS estimator—typically the best shrinkage parameter will not be 0 and 
the ridge estimator will differ from the OLS estimator. 


Application to School Test Scores 


We illustrate the use of ridge regression by fitting a predictive model for school test 
scores using the 817 predictors in Table 14.1 with 1966 observations. 

Figure 14.2 plots the square root of the 10-fold cross-validation estimator of the 
MSPE as a function of the ridge shrinkage parameter Aidge. The square root of the 
MSPE is plotted so that it provides an estimate of the magnitude of a typical out-of- 
sample prediction error. For a given value of Apiage, the MSPE was computed as 
described in Key Concept 14.1. The choice of m = 10 represents a practical balance 
between the desire to use as many observations as possible to estimate the parame- 
ters and the computational burden of repeating that estimation m times for each 
value of Apidge- 

As Figure 14.2 shows, the MSPE has a U shape. It is minimized at Apidge = 2233, 
so the 10-fold cross-validation estimate of the ridge shrinkage parameter is 
Åridge = 2233. 

The square root of the MSPE, evaluated at Apiage, is 39.5. In contrast, the root 
MSPE for OLS, estimated using the same 817 predictors and 1966 observations, is 
much larger, 78.2. Because the OLS estimator is the ridge estimator with Apiage = 0, 
in principle the root MSPE of the OLS estimator could also be shown in Figure 14.2 
as the point (Agidge = 0, root MSPE = 78.2); however, the root MSPE for OLS is 
so large that it is off the scale of the figure. 


| FIGURE 14.2 ] Square Root of the MSPE for the Ridge Regression Prediction as a Function of the 


The MSPE is estimated using Square root of MSPE 
10-fold cross validation for 51 


the school test score data set 
with k = 817 predictors and 49 = 
n = 1966 observations. 47L 
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The fact that the OLS MSPE is much larger than the ridge MSPE provides an 
empirical demonstration of the main theoretical point discussed in Section 14.2: 
When there are many predictors, introducing bias into the parameter estimates 
via shrinkage can reduce the variance of the prediction by more than enough to 
compensate for the bias and therefore produce much more accurate predictions. 

Because Å Ridge is chosen to minimize the cross-validated MSPE, the cross- 
validated MSPE evaluated at ÀRiage is no longer an unbiased estimator of the MSPE. 
In Section 14.6, we use the remaining 1966 observations (not used so far) to obtain 
an unbiased estimator of the MSPE for ridge regression using į Ridge 

It is also of interest to compare the ridge regression coefficients to the OLS coef- 
ficients. That comparison is made in Section 14.6, where these coefficients are also 
compared to the methods discussed in Sections 14.4 and 14.5, the Lasso and principal 
components, respectively. 


The Lasso 


In OLS and ridge regression, none of the estimated coefficients is exactly 0 so all the 
regressors are used to make the prediction. In some applications, however, only a few 
predictors might be useful, with the rest irrelevant. For example, among the predic- 
tors in Table 14.1, all but 38 are constructed as squares, cubes, or interactions of the 
38 main variables; if the true conditional expectation is, in fact, linear in the 38 main 
variables, then 817 — 38 = 779 of the variables would have a coefficient of 0. 
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A regression model in which the coefficients are nonzero for only a small frac- 
tion of the predictors is called a sparse model. If the model is sparse, predictions can 
be improved by estimating many of the coefficients to be exactly 0. 

The estimator examined in this section, the Lasso (least absolute shrinkage and 
selection operator), is designed for sparse models. Like ridge regression, the Lasso 
shrinks estimated coefficients to 0. Unlike ridge regression, it sets many of the esti- 
mated coefficients exactly to 0, thereby dropping those regressors from the model. 
Moreover, the regressors it keeps are subject to less shrinkage than with ridge regres- 
sion. Thus, the Lasso provides a way to select a subset of the regressors and then 
estimate their coefficients with a modest amount of shrinkage. 

Like ridge regression, the Lasso can be used when k > n. Also like ridge regres- 
sion, the Lasso has a shrinkage parameter that can be estimated by minimizing the 
cross-validated MSPE. 


Shrinkage Using the Lasso 


The Lasso estimator minimizes a penalized sum of squares, where the penalty 
increases with the sum of the absolute values of the coefficients: 


k 

SDs Araso) = X, (Y; — By hye ose bk Xi)” + ÀLasso (14.9) 
. A 
where Àzasso is called the Lasso shrinkage parameter. The Lasso estimator is the value 
of b that minimizes S+? ( b; A Lasso). AS with ridge regression, if the shrinkage param- 
eter Azasso = 0, the Lasso estimator minimizes the sum of squared residuals in which 
case the Lasso is just OLS. The second term in Equation (14.9) penalizes large values 
of b and thus shrinks the Lasso estimate toward 0.? 

The first part of the Lasso name—least absolute shrinkage — reflects the nature 
of the penalty term in Equation (14.9). Whereas the ridge regression penalty increases 
with the square of b, the Lasso penalty increases with its absolute value. 

The second part of the Lasso name—selection operator—arises because the 
Lasso estimates many coefficients to be exactly 0, thereby dropping some of the 
predictors. Thus the Lasso, in effect, selects a subset of the predictors to be used in 
the model. 

The reason that the Lasso estimates some coefficients to be exactly 0 is illus- 
trated in Figure 14.3 for k = 1. This figure shows the sum of squared residuals, the 
Lasso penalty, and the combined Lasso minimization function in Equation (14.9). 
Parts a and b of Figure 14.3 differ only in the value of the OLS estimate, which mini- 
mizes the first term in Equation (14.9). In Figure 14.3a, the OLS estimate is far from 


The ridge and Lasso re terms can both be written as A _,|5;|?, where p = 2 for ridge and p = 1 
for Lasso. The expression ( 115;/") 1/P is called the L, length of b, where p = 2 corresponds to the usual 
Euclidean distance. As a bel ihe ridge is sometimes called L, penalization, and the Lasso is sometimes 
called L, penalization. 
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| FIGURE 14.3 | The Lasso Estimator Minimizes the Sum of Squared Residuals Plus a Penalty 
That Is Linear in the Absolute Value of b 


For a single regressor, 
(a) when the OLS esti- 
mator is far from zero, 
the Lasso estimator 
shrinks it toward 0; (b) 
when the OLS estimator 
is close to 0, the Lasso 
estimator becomes 
exactly 0. 
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0 (Ê = 1.0), and the Lasso shrinks it to a smaller value (64° = 0.5). In Figure 14.3b, 
the curve representing the sum of squared residuals is shifted to the left, so the OLS 
estimate is smaller (Ê = 0.4), and the Lasso estimate is exactly 0 (go = 0). This 
estimate of exactly 0 arises because the sum of squared residuals function in 
Figure 14.3b is so flat near 0 that the penalty term takes over from the sum of squared 
residuals and drives the estimate to 0. 

Appendix 14.4 provides a formula for the Lasso estimator when k = 1. The for- 
mula shows mathematically that for sufficiently small values of the OLS estimator, 
the Lasso estimator is exactly 0. 

The ridge and Lasso estimators also behave differently when the OLS estimate is 
large. For large values of b, the ridge penalty exceeds the Lasso penalty. Thus, when the 
OLS estimate is large, the Lasso shrinks it less than ridge, but when the OLS estimate 
is small, the Lasso shrinks it more than ridge —in some cases, all the way to 0. 

Figure 14.3 considers the case of a single regressor, for which the Lasso always 
shrinks the OLS estimator toward 0. If there are multiple predictors, then the Lasso 
generally shrinks the OLS estimates toward 0; however, it is possible that the Lasso 
estimate of some of the coefficients could be larger than the OLS estimate. 


Computation of the Lasso estimator. Unlike OLS and ridge regression, there is no 
simple expression for the Lasso estimator when k > 1, so the Lasso minimization 
problem must be done numerically using a computer. One of the many computa- 
tional advances in machine learning is the development of specialized algorithms to 
compute the Lasso estimator. Some econometric software packages incorporate 
these algorithms and make it straightforward to use the Lasso estimator. 


Estimation of the shrinkage parameter by cross validation. As in ridge regression, 
the Lasso tuning parameter can be estimated by minimizing an estimate of the 
MSPE. The algorithm for estimating Àz asso is the same as that laid out in Section 14.3 
for estimating Apidge- 


A word of warning about the ridge and Lasso estimators. The ridge and Lasso esti- 
mators differ from all the other estimators used in this text in an important way. In 
OLS, the fit of the regression model is the same whether one uses the k original 
regressors or k linear combinations of the regressors as long as one avoids perfect 
multicollinearity. For example, one can use an intercept and a dummy variable for 
male, or an intercept and a dummy variable for female, or both a male dummy and a 
female dummy and no intercept; all yield identical fits of the OLS regression and 
identical predictions. Moreover, which of these three specifications is used makes no 
difference for the other estimated coefficients in the model. 

In contrast, with ridge and Lasso the regression fit, the estimated coefficients, and 
the predictions in general depend on the specific choice of the linear combination of 
regressors used. This is easiest to see for the Lasso because the population values of 
the coefficients change as you change linear combinations. For example, the 
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coefficient on male in the (intercept, male) specification differs from that in the 
(female, male) specification. Thus the Lasso might drop male from the (intercept, 
male) specification but not from the (female, male) specification. If so, the (intercept, 
male) and (female, male) specifications would have different selected predictors and 
thus would make different predictions. 

The reason that the choice of linear combinations matters for ridge is more 
subtle and stems from the fact that different linear combinations will have different 
correlations with each other. An explanation of this result for ridge regression is 
given in Appendix 19.7. 

The dependence of the ridge and Lasso estimators on the choice of linear combina- 
tion of regressors implies that one needs to put thought into choosing the regressors when 
using these estimators —a decision that does not matter for OLS or for the principal com- 
ponents method of Section 14.5 (or, for that matter, for logit, probit, or IV regression). 


Application to School Test Scores 


We now turn to estimation of a Lasso prediction model using the same 817 regressors 
and 1966 observations as in Section 14.3. 

Figure 14.4 plots the square root of the 10-fold cross-validation estimate of the 
MSPE as a function of the Lasso shrinkage parameter À zasso: The MSPE is minimized 
when the shrinkage parameter is 4527, so i Lasso = 4527 At this estimated value of 
A asso the MSPE is 39.7. This MSPE is much less than the MSPE of OLS, 78.2, which 
is equivalent to the Lasso estimator for Azasso = 0. The Lasso MSPE is close to, but 
slightly greater than, the minimized ridge MSPE of 39.5 (from Section 14.3). 


a 
| FIGURE 14.4 | Square Root of the MSPE for the Lasso Prediction as a Function of the 
Lasso Shrinkage Parameter (Log Scale for Ajasso) 


A 
À Lasso = 4527 


ALasso 


The MSPE is estimated by Square root of MSPE 
10-fold cross validation using 46 - 
the school test score data set 
with k = 817 predictors and 
n = 1966 observations. 44 = 
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14.5 


The Lasso estimates nonzero coefficients on only 56 of the 817 predictors; thus 
the Lasso estimator excludes 761, or 93%, of the candidate predictors in Table 14.1. 
Of the retained predictors, all but 4 are interactions among the 38 main predictors in 
Table 14.1. 


Principal Components 


When the regressors are perfectly collinear, at least one of them can be dropped from 
the data set without any loss of information because the dropped regressor can be 
perfectly reconstructed from the retained regressors. This observation suggests that 
there might be little loss of information from dropping a variable that is highly, but 
imperfectly, correlated with the other regressors. This insight forms the basis for an 
alternative strategy for handling many predictors: Exploit the correlations among the 
regressors to reduce the number of regressors while retaining as much of the infor- 
mation in the original regressors as possible. Principal components analysis imple- 
ments this strategy and can reduce sharply the number of regressors so that estimation 
and prediction can proceed using OLS. 

This section begins by showing how principal components analysis works when 
there are two regressors. We then turn to the more relevant case when the number of 
regressors is large. 


Principal Components with Two Variables 


The principal components of a set of standardized variables X are linear combina- 
tions of those variables, where the linear combinations are chosen so that the princi- 
pal components are mutually uncorrelated and sequentially contain as much of the 
information in the original variables as possible. Specifically, the linear combination 
weights for the first principal component are chosen to maximize its variance, in this 
sense capturing as much of the variation of the X’s as possible. The linear combina- 
tion weights for the second principal component are chosen so that it is uncorrelated 
with the first principal component and captures as much of the variance of the X’s as 
possible, controlling for the first principal component. The third principal component 
is uncorrelated with the first two and captures as much of the variance of the X’s as 
possible, controlling for the first two principal components, and so forth. If k = n and 
there is no perfect multicollinearity, then the total number of principal components 
is k. If k > n, then the total number of principal components is n. 

It is easiest to see how this procedure works when there are two X’s. Figure 14.5 
illustrates this case when X; and X, are standard normal random variables with a 
correlation of 0.7 The first principal component is the weighted average, 
PC, = wX, + w2X>, with the maximum variance, where w, and w, are the principal 
component weights. Choosing the weights corresponds to choosing a direction in 
which to add the variables or, equivalently, choosing a direction in which the spread 


The first principal component (PC;) 4 
maximizes the variance of the linear 
combination of these variables, which is done 
by adding X; and X3. The second principal 
component (PC) is uncorrelated with the 
first and is obtained by subtracting the two 2 
variables. The principal component weights 


are normalized so 
weights adds to 1. 
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Scatterplot of 200 Observations on Two Standard Normal Random Variables, X, and 
Xa, with Population Correlation 0.7 


X2 
PCi — (Xı a X 


PC, = (X — Xp)/V2 


that the sum of squared 


of the variables is greatest. As Figure 14.5 illustrates, the spread of the variables is 
greatest in the direction of the 45° line. Along this direction, the variables are added 
together with equal weights. 

Without further restrictions, the variance of the linear combination can always 
be increased simply by increasing both w, and w2. Thus, for the principal components 
problem to have a solution, it is necessary to restrict the weights. This is done by 
requiring the sum of squared weights to equal 1; that is, wt + w3 = 1. Along the 45° 
line, the weights are equal, so wy = w = 1/V2 and PC, = (X + X%)/V2, a 
result derived mathematically in Exercise 14.11. 

The second principal component is chosen to be uncorrelated with the first prin- 
cipal component, and the sum of its squared weights also equals 1. When there are 
two variables, these two requirements imply that PC, = (X, — X2)/ V2. This cor- 
responds to adding the variables along the downward-sloping 45° line in Figure 14.5. 
As illustrated in the figure, the spread of the variables is minimized in this direction. 
Thus, when there are only two variables, the first principal component maximizes the 
variance of the linear combination, while the second principal component minimizes 
the variance of the linear combination. 

The variances of the two principal components are var(PC,) = 1 + |p| and 
var(PC,) =1-— |p 
confirm that if the variables are correlated, PC; has a greater variance than PC). 


, where p = corr(X;, X2) (Exercise 14.11). These expressions 
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These expressions for the variances of PC, and PC, have another, more subtle 
feature: var(PC,) + var(PC,) =var(X,) + var(X).° This provides an R? interpre- 
tation of principal components: The fraction of the total variance explained by the first 
principal component is var(PC,) /[ var(X,) + var( X) ], and the fraction explained 
by the second is var( PC) /[ var(X,) + var( X) ]. Together, the two principal compo- 
nents explain all the variance of X. For the two variables in Figure 14.5, the correlation 
is 0.7 so the first principal component explains (1 + p) /2 = 85% of the variance of X, 
while the second principal component explains the remaining 15% of the variance of X. 

If there are only two variables, there is little reason to reduce their number using 
principal components. The utility of principal components arises when there are 
many correlated variables, in which case much or most of the variation in those vari- 
ables can be captured by a smaller number of principal components. 


Principal Components with k Variables 


The principal components of the k variables X4, . . . , X, are the linear combinations 
of those variables that are mutually uncorrelated, have squared weights that sum to 1, 
and maximize the variance of the linear combination controlling for the previous 
principal components. Assuming there is no perfect multicollinearity among the vari- 
ables, the number of principal components of X is the minimum of n and k. 

Expressions for the principal component weights for k > 2 are more compli- 
cated than when k = 2. Fortunately, there is a fast method for computing the princi- 
pal components and their weights. Because this method entails matrix calculations, 
it is deferred to Appendix 19.7 This procedure for computing principal components 
is widely available in standard statistical software. 

Principal components with k variables is summarized in Key Concept 14.2. 


The scree plot. The equality in Equation (14.10) leads to a useful graph, known as a 
scree plot, for visualizing the amount of variation in X that is captured by the j™ 
principal component. 

A scree plot is the plot of the sample variance of the j" principal component relative 
to the total sample variance in the X’s (that is,the sample value of var (PC;) / >= , var (X})) 
against the number of the principal component, j. Because this ratio has the interpre- 
tation of the R? of the j} principal component, the scree plot makes it possible to read 
off the fraction of the sample variance of the Xs explained by any particular principal 
component. Because the principal components are mutually uncorrelated, the cumu- 
lative sum of these ratios through the p" principal component is the fraction of the 
total sample variance of X explained by the first p principal components. 

Figure 14.6 is the scree plot for the first 50 principal components of the 
817-variable data set in Table 14.1. The first principal component explains 18% of the 


ŝFor k = 2, this can be verified by adding the two expressions for the variances of the principal compo- 
nents: var(PC,) + var(PC,) = (1+ |p|) + (1 — |p|) = 2 = var(X,) + var(X)), where the final 
equality follows because X; and X; are standardized and thus have unit variance. 
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The Principal Components of X 


The principal components of the k variables X4, . . . , Xx are the linear combina- 
tions of X that have the following properties: 


(i) The squared weights of the linear combinations sum to 1; 


(ii) The first principal component maximizes the variance of its linear 
combination; 


(iii) The second principal component maximizes the variance of its linear 


combination, subject to its being uncorrelated with the first principal 
component; and 


(iv) More generally, the j} principal component maximizes the variance of 
its linear combination, subject to its being uncorrelated with the first 
j — 1 principal components. 


e Assuming there is no perfect multicollinearity in X, the number of prin- 
cipal components is the minimum of n and k. 


e The sum of the sample variances of the principal components equals the 
sum of the sample variances of the X’s: 


min(n,k) k 
> var(PC;) = X var(X;). (14.10) 
j=l j=1 
e The ratio var(PC;) / Spaa var ( X;) is the fraction of the total variance of 


the X’s explained by the j™ principal component. This measure is like an 
R? for the total variance of the X’s. 


14.2 


total sample variance of the 817 X’s, and the second principal component explains 
11% of the total variance. Thus 29%, or more than one-fourth, of the total variance 
of the 817 variables is explained by just these two principal components. The first 10 
principal components explain 63% of the total variance of the 817 X’s, and the first 
40 principal components explain 92% of the total variance. 

The flattening in Figure 14.6 after the first few principal components is typical of 
many data sets in which the variables are highly correlated, as they are in the 
817-variable school test score data set. This feature gives the scree plot its name: It 
looks like a cliff, with boulders, or scree, cascading into a valley. 


Prediction using principal components. The fact that so much of the variation in the 
817 predictors is captured by the first 10, or 50, principal components suggests that 
one could replace the 817 predictors with far fewer principal components and use 
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| FIGURE 14.6 | Scree Plot for the 817-Variable School Data Set (First 50 Principal Components) 
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Plotted values are the fraction of the total variance of the 817 regressors explained by the indicated principal 
component. The first principal component explains 18% of the total variance of the 817 X's, and the first 10 principal 


components together explain 63% of the total variance. 
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those principal components as regressors. With many fewer regressors, the coeffi- 
cients can be estimated using OLS. 

A key question is how many principal components p to include in the regression. 
Like the ridge and Lasso shrinkage parameters, the number of principal components 
p can be estimated by minimizing the MSPE, where the MSPE is estimated by m-fold 
cross validation. 

As discussed following Equation (14.3), computing the predicted value for an 
out-of-sample observation requires standardizing the observation using the in- 
sample mean and variance of each predictor. In the case of principal components 
regression, the out-of-sample values of the principal components must, in addition, 
be computed by applying the weights (the w’s) estimated using in-sample data values 
to the standardized X’s. The details are discussed in Appendix 14.5. 


Application to School Test Scores 


Figure 14.7 plots the square root of the 10-fold cross-validation estimate of the MSPE 
of the principal components predictor of school test scores as a function of the num- 
ber p of principle components used as regressors; the principle components were 
computed using the same 817 predictors and 1966 observations as in Sections 14.3 


Square Root of the MSPE for the Principal Components Prediction as a 


The MSPE is estimated using Square root of MSPE 
10-fold cross validation for 60 -— 
the school test score data 
set with k = 817 predictors 55 
and n = 1966 observations. > 
50 H 
45 H 
40 - 
l 
i 
35 | | | | 4 1 J 
0 10 20 30 40 50 60 
p= 46 
Principal component number (p) 


14.6 Predicting School Test Scores with Many Predictors 537 


Function of the Number of Principal Components p Used as Predictors 


= 


14.6 


and 14.4. Initially, increasing the number of principal components used as predictors 
results in a sharp decline in the MSPE. After p = 5 principal components, the 
improvement slows down, and after p = 23 principal components, the MSPE is 
essentially flat in the number of predictors. The MSPE is minimized at 46 predictors, 
so this is the cross-validation estimate of p; that is, p = 46. Using 46 principal com- 
ponents, the MSPE is 39.7 the same as for Lasso and just slightly more than for ridge. 


Predicting School Test Scores 
with Many Predictors 


Do the many-predictor methods improve upon test score predictions made using 
OLS with a small data set and, if so, how do the many-predictor methods compare? 
To find out, we predict school test scores using small (k = 4), large (k = 817), and 
very large (k = 2065) data sets. For the small data set, the predictions are made using 
OLS. For the other data sets, they are made using OLS, ridge regression, the Lasso, 
and principal components. 

As was stressed in Section 14.2, the predictive performance that matters is perfor- 
mance out of sample. Because the m-fold MSPE is used to estimate the ridge and Lasso 
shrinkage parameters and the number of included principal components p, the MSPE 
no longer provides a true out-of-sample comparison among the prediction methods. We 
therefore have reserved half the observations for assessing the performance of the esti- 
mated models; we call these remaining observations the reserved test sample. 
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Specifically, we use the following procedure, explained here for ridge regression, 
to assess predictive performance. Using the 1966 observations in the estimation sam- 
ple, we estimate the shrinkage parameter Aiage by 10-fold cross validation; for the 
817-predictor data set, this yields the estimate Apiage reported in Section 14.3. Using 
this estimated shrinkage parameter, the ridge regression coefficients are reestimated 
using all 1966 observations in the estimation sample. Those estimated coefficients are 
then used to predict the out-of-sample values Y"° for all the observations in the 
reserved test sample. Analogous procedures are used for the Lasso and principal 
components. 

Table 14.2 lists the three sets of predictors. The 4 predictors in the small set 
are similar to some regressors in Chapters 5-9 for the district-level test score 
regressions. The 817 predictors are those in Table 14.1. The very large set aug- 
ments the 38 main variables in Table 14.1 with demographic data on residents in 
the neighborhood of the school (age distribution, sex, marital status, education, 
and immigrant status), as well as some binary descriptors of the school and dis- 
trict, for a total of 65 main variables. For the very large data set, these 65 main 
variables are further augmented by all interactions, squares, and cubes, for a total 
of 2065 predictors—more than the number of observations (1966) in the estima- 
tion sample! 


(TABLE 14.2 | The Three Sets of Predictors, School Test Score Data Set D 
Small (k = 4) 
School-level data on Student-teacher ratio 
Median income of the local population 
Teachers’ average years of experience 
Instructional expenditures per student 
Large (k = 817) 
The full data set in Table 14.1 
Very Large (k = 2065) 
The main variables are those in Table 14.1, augmented with the 27 variables below, for a 
total of 65 main variables, 5 of which are binary: 
Population Immigration status variables (4) 
Age distribution variables in local population (8) Charter school (binary) 
Fraction of local population that is male School has full-year calendar (binary) 
Local population marital status variables (3) School is in a unified school district (large city) (binary) 
Local population educational level variables (4) School is in Los Angeles (binary) 
Fraction of local housing that is owner occupied School is in San Diego (binary) 
+ Squares and cubes of the 60 nonbinary variables (60 + 60) 
+ All interactions of the nonbinary variables (60 xX 59/2 = 1770) 
+ All interactions between the binary variables and the nonbinary demographic variables (5 X 22 = 110) 
(Total number of variables = 65 + 60 + 60 + 1770 + 110 = 2065 
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Out-of-Sample Performance of Predictive Models for School Test Scores 
Ridge Principal 
Predictor Set OLS Regression Lasso Components 
Small (k = 4) 
Estimated A or p — — — — 
In-sample root MSPE 53.6 — — — 
Out-of-sample root MSPE 52.9 — — — 
Large (k = 817) 
Estimated A or p — 2233 4527 46 
In-sample root MSPE 78.2 39.5 39:7 39.7 
Out-of-sample root MSPE 64.4 38.9 39.1 39.5 
Very large (k = 2065) 
Estimated A or p — 3362 4221 69 
In-sample root MSPE — 39.2 39:2 39.6 
Out-of-sample root MSPE — 39.0 39.1 39.6 
Notes: The in-sample MSPE is the 10-fold cross-validation estimate computed using the 1966 observations in the estimation 
sample. For the many-predictor methods, the shrinkage parameter or p was estimated by minimizing this in-sample MSPE. The 
out-of-sample MSPE is a split-sample estimate, computed with the 1966 observations in the reserved test sample and using the 
model estimated from the full estimation sample. 


ea E 


The results of this comparison are summarized in Table 14.3. Four features stand 
out. First, the MSPE of OLS is much less using the small data set than using the large 
data set (OLS cannot be computed in the very large data set because k > n). When 
there are many regressors, OLS is unable to use the additional information to 
improve out-of-sample prediction. 

Second, for the many-predictor methods, there are substantial gains from increas- 
ing the number of predictors from 4 to 817, with the square root of the MSPE falling 
by roughly one-fourth. There are no further gains, however, from going to the very 
large set of regressors. 

Third, the in-sample estimates of MSPE (the 10-fold cross-validation estimates) 
are similar to the out-of-sample estimates. In fact, the out-of-sample MSPEs are 
slightly less than the in-sample MSPEs. There are two reasons for this surprising 
result. First, the 10-fold MSPE uses only 90% of the data for estimating the coeffi- 
cients at any one time (that is, 0.9 x 1966 = 1769 observations), whereas the 
coefficients used for the out-of-sample estimate of the MSPE are estimated using all 
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Student-teacher ratio 

Median income of the local population 
Teachers’ average years of experience 
Instructional expenditures per student 


Student—teacher ratio X Instruction 
expenditures per student 


Student-teacher ratio X Fraction 
of English learners 


Free or reduced-price lunch X Index 
of part-time teachers 


Ra 


WNAE T Coefficients on Selected Standardized Regressors, 4- and 817-Variable Data Sets 
Predictor k=4 k = 817 
Ridge Principal 
OLS OLS Regression Lasso Components 
4.51 118.03 0.31 0 0.25 
34.46 —21.73 0.38 0 0.30 
1.00 —79.59 —0.11 0 —0.17 
0.54 —1020.77 0.11 0 0.19 
—89.79 0.72 2.31 0.84 
—81.66 —0.87 —5.09 —0.55 
29.42 —0.92 —8.17 —0.95 
Notes: The index of part-time teachers measures the fraction of teachers who work part-time. For OLS, ridge, and Lasso, the coeffi- 
cients in Table 14.4 are produced directly by the estimation algorithms. For principal components, the coefficients in Table 14.4 are 
computed from the principal component regression coefficients (the y’s in Equation ((14.13)), combined with the principal component 
weights. The formula for the £ coefficients for principal components is presented using matrix algebra in Appendix 19.7. 


1966 observations in the estimation sample. As a result, those latter coefficient esti- 
mates are more precise. Second, there is random sampling variation in both estimates. 
The more general point is that the in-sample 10-fold MSPEs provide a good guide to 
the out-of-sample MSPE. 

Fourth, the MSPE in the reserved test sample is generally similar for all the 
many-predictor methods. This is not always the case; it just happens to be so in this 
application. For these data, ridge regression has a slight edge, and the lowest out-of- 
sample MSPE is obtained using ridge in the large data set. 

Table 14.4 lists the coefficients on 7 of the variables in the 817-predictor data set; 
4 of the 7 are those in the small data set. Although none of these coefficients has a 
causal interpretation, comparing them across the different methods and data sets 
gives insights into how the various methods work. Because the regressors are stan- 
dardized, all the coefficients have the same units, points on the test per standard 
deviation of the original predictor.’ 

Table 14.4 has several striking features. For the small model, the magnitudes of 
the coefficients accord with the findings of Chapter 9 using the district-level data; for 
example, a one-standard-deviation greater value of median income predicts a 


“The coefficients on the X’s in the principal components column are obtained by combining the two steps 
of prediction using principal components. Specifically, the principal components are linear combinations of 
the X’s, and the principal components regression model is a linear combination of the principal components. 
Thus the prediction can be written as a linear combination of the X’s, where the weights involve both the 
principal components weights and the regression weights. The relevant formulas are given in Appendix 19.7. 
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34-point higher score on the test (the standard deviation of the test scores across 
schools is 64 points). In the large data set, however, many of the OLS coefficients are 
extremely large, and the pattern is erratic. With many regressors, OLS can fit indi- 
vidual observations by estimating large coefficients on specific variables, and this 
seems to be what is happening. This overfitting is why the predictive performance of 
OLS deteriorates moving from the small to the large data set. In contrast, the esti- 
mated coefficients for the many-predictor methods are substantially smaller and do 
not exhibit wild values. For the seven predictors in the table, the ridge and principal 
components coefficients are numerically similar. The Lasso coefficients, however, 
differ substantially from the ridge and principal components coefficients. Most nota- 
bly, many of the Lasso coefficients (92% in all) are 0, including the coefficients on the 
four variables in the small data set. For the three coefficients in the table that are 
nonzero, they have the same sign as the ridge and principal components coefficients 
but are much larger, an empirical illustration of the tendency of Lasso to shrink more 
than ridge for small coefficients but to shrink less than ridge for large ones. 
Another way to compare predictive models is to look at their predictions. 
Figure 14.8 shows scatterplots of the four sets of predictions for the 817-variable 
model, where the predictions are for the 1966 observations in the reserved test set. 


‘i 
WU EA Scatterplots for Out-of-Sample Predictions Using the 817-Predictor Data Set 
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14.7 


Specifically, Figure 14.8a shows a scatterplot of the actual test scores versus the OLS 
predictions, and Figure 14.8b is the scatterplot of the actual test scores versus the 
ridge predictions. Figure 14.8c and Figure 14.8d are scatterplots of the Lasso versus the 
ridge predictions and the principal components versus the ridge predictions, respectively. 

In Figure 14.8a and Figure 14.8b, the tighter the spread of the scatter along the 45° 
line, the better the prediction. Ridge has a tighter scatter than OLS, and it makes better 
out-of-sample predictions. (These scatterplots understate the improvement of ridge over 
OLS because some of the OLS predictions are outside the vertical scale of the plot.) 

The clustering of the points along the 45° line in Figure 14.8c and Figure 14.8d 
indicate that the ridge, Lasso, and principal components predictions are generally 
quite similar. Still, one can see quite a few schools for which the predictions differ by 
at least 15 points, a substantial amount. Thus, while the three models have quite simi- 
lar performance as measured by the MSPE (Table 14.3), for any given school the 
predictions can differ meaningfully. 

The most important conclusion from this application is that for the large data set 
the many-predictor methods succeed where OLS fails. The reason for this success is 
that the many-predictor methods allow the coefficient estimates to be biased in a 
way that reduces their variance by enough to compensate for the increased bias. 
Another important conclusion is that the m-fold MSPE is close to the MSPE com- 
puted using the reserved test sample. One finding that does not generalize, however, 
is that the three methods happen to perform equally well in these data. 


Conclusion 


The coefficients in the predictive regression model do not have a causal interpreta- 
tion. This does not matter, however, when the goal is prediction; the aim simply is to 
make out-of-sample predictions that are as accurate as possible, where accuracy is 
measured by the MSPE. 

This chapter presented three methods for making predictions with many predic- 
tors. These methods provide different ways to overcome the poor performance of 
OLS predictions when the number of regressors is large relative to the sample size. 
The methods covered in this chapter—ridge regression, the Lasso, and principal com- 
ponents regression —all introduce bias into the estimator of the 6’s. However, this 
bias is introduced in a way that reduces the variance of the prediction by enough to 
yield a smaller MSPE. 

Although ridge regression, the Lasso, and principal components regression all 
reduce variance by introducing bias, they do so in quite different ways. The Lasso sets 
many of the coefficients exactly to 0, in effect discarding those predictors. This 
approach works well when the oracle prediction model is sparse or approximately so. 
Principal components regression is most appropriate when the predictors, or groups 
of predictors, are highly correlated, in which case most of the variation in the regres- 
sors can be captured by a relatively small number of linear combinations of the 
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Text as Data 


T ext contains a lot of information! That is why 
you read the newspaper or posts on social 
media. That information keeps you abreast of politi- 
cal developments and helps you decide what to do 
tonight. By reading these sources, you use textual 
information — textual data—to make predictions 
about outcomes that are relevant to you. 

A major accomplishment of statistics and machine 
learning is figuring out how to use computers to read 
text and to make predictions using textual data. At a 
conceptual level, it is a big leap to go from analyzing 
numbers to analyzing texts. The key step in doing so 
is turning text data into numerical data. 

One way to turn text data into numerical data is 
to develop a list of words or phrases and then count 
the number of times that these words or phrases 
occur in a given text excerpt (for example, a news- 
paper article or blog post). These counts of words or 
phrases are numerical data that summarize the text. 
The unit of observation is the text excerpt, and the 
number of observations is the number of excerpts 
analyzed. This method of distilling a set of texts into 
occurrence counts of words or phrases was devel- 
oped by Frederick Mosteller and David Wallace 
(1963) and is the basis of the field of stylometrics 
(see the box titled “Who Invented Instrumental 


Variables Regression?” ). 


The approach of distilling text into counts of 
words or phrases has its own jargon. The list of words 
in a text is called a bag of words. The list of words and 
phrases of interest is called the dictionary. The dic- 
tionary may include only the words or phrases that 
are relevant to the prediction problem at hand, or 
it may contain all the words in the bag of words, 
excluding (for example) articles, pronouns, and 
conjunctions. 

The word counts now can be used as predic- 
tors (X’s) to predict a variable Y of interest. Thus, 
this bag-of-words approach has turned a seemingly 
intractable problem of combining text and numeri- 
cal data into a regression problem. 

Because the dictionary typically contains many 
words, the number of predictors can be large relative 
to the number of texts (n). If so, OLS would tend 
to produce poor predictions, but the methods in this 
chapter can be applied directly. For example, princi- 
pal components analysis can be a useful tool in this 
setting because often words appear in groups (think 
of the words used in an article about a baseball game 
compared to an article about macroeconomic con- 
ditions). Putting all these pieces together results in 
predictive models that take text as the input and 


yield a prediction of Y. 
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variables — specifically, by their first few principal components. Because these prin- 
cipal components are relatively few in number, they can be the regressors used in a 
multiple regression model estimated by OLS. Ridge regression shrinks the OLS esti- 
mates toward 0 but does not rely on there being sparsity or on the regressors being 
highly correlated; thus it provides a useful approach when the regressors are not 
highly correlated and there is no a priori reason to assume sparsity. As it happens, in 
the school test score data, the three methods perform similarly, but this coincidence 
does not occur in general. 

As discussed in Section 14.1, making predictions using many predictors that 
take on numerical values is only one of the opportunities provided by the meth- 
ods of machine learning. For example, the box “Text as Data” describes how the 
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tools of this section can be used to analyze text data. Similarly, principal compo- 
nents analysis and its extensions can be used to turn images into numerical data, 
which then can be analyzed by the many-predictor methods described in this 
chapter. While many of the procedures in machine learning are new and the com- 
putational algorithms and tools are sophisticated, at their core are the key ideas 
of regression analysis, estimation, and testing that are at the heart of Parts I-III 
of this text. 

The use of machine learning in economics is young, and many exciting applica- 
tions await. For some examples and further reading, see Jean et. al. (2016) (predicting 
poverty using satellite imagery), Davis and Heller (2017) (examining treatment het- 
erogeneity for a summer jobs program), and Kleinberg et. al. (2018) (application of 
machine learning to criminal sentencing).° 


Summary 


1. The goal of prediction is to make accurate predictions for out-of-sample 
observations —that is, for observations not used to estimate the prediction 
model. 


2. The coefficients in prediction models do not have a causal interpretation. 


3. One of the opportunities provided by big data sets is making predictions using 
many predictors. However, OLS works poorly for prediction when the number 
of regressors is large relative to the sample size. 


4. The shortcomings of OLS can be overcome by using prediction methods that 
have lower variance at the cost of introducing estimator bias. These many- 
predictor methods can produce predictions with substantially better predictive 
performance than OLS, as measured by the MSPE. 


5. Ridge regression and the Lasso are shrinkage estimators that minimize a penal- 
ized sum of squared residuals. The penalty introduces a cost to estimating large 
values of the regression coefficient. The weight on the penalty (the shrinkage 
parameter) can be estimated by minimizing the m-fold cross-validation estima- 
tor of the MSPE. 


6. The principal components of a set of correlated variables capture most of the 
variation in those variables in a reduced number of linear combinations. Those 
principal components can be used in a predictive regression, and the number 
of principal components included can be estimated by minimizing the m-fold 
cross-validation MSPE. 


>The field of machine learning is growing rapidly. A textbook introduction to this area, which is accessible 
to students after completing Parts I-III of this text, is Gareth James et al., An Introduction to Statistical 
Learning (2013). 
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Review the Concepts 


14.1 


14.2 


14.3 


14.4 


14.5 


Using data from a random sample of elementary schools, a researcher regresses 
average test scores on the fraction of students who qualify for reduced-price 
meals. The regression indicates a negative coefficient that is highly statistically 
significant and yields a high R°. Is this regression useful for determining the 
causal effect of school meals on student test scores? Why or why not? Is this 
regression useful for predicting test scores? Why or why not? 


Cross-validation uses in-sample observations. How does it estimate the MSPE 
for out-of-sample observations, even though the econometrician does not 
have those observations? 


Regression coefficients estimated using shrinkage estimators are biased. Why 
might these biased estimators yield more accurate predictions than an unbi- 
ased estimator? 


Ridge regression and Lasso are two regression estimators based on penaliza- 
tion. Explain how they are similar and how they differ. 


Suppose a data set with 10 variables produces a scree plot that is flat. What 
does this tell you about the correlation of the variables? What does this sug- 
gest about the usefulness of using the first few principal components of these 
variables in a predictive regression? 
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Exercises 


14.1 


14.2 


A researcher is interested in predicting average test scores for elementary 
schools in Arizona. She collects data on three variables from 200 randomly 
chosen Arizona elementary schools: average test scores (TestScore) on a 
standardized test, the fraction of students who qualify for reduced-priced 
meals (RPM), and the average years of teaching experience for the school’s 
teachers (TExp).The table below shows the sample means and standard devi- 
ations from her sample. 


Variable Sample Mean Sample Standard Deviation 
TestScore 750.1 65.9 

RPM 0.60 0.28 

TExp 13.2 3.8 


After standardizing RPM and TEXP and subtracting the sample mean from 
TestScore, she estimates the following regression: 


—_— 
TestScore = —48.7 X RPM + 8.7 X TExp, SER = 44.0 


a. You are interested in using the estimated regression to predict average test 
scores for an out-of-sample school with RPM = 0.52 and TEXP = 11.1. 


i. Compute the transformed (standardized) values of RPM and TEXP 
for this school; that is, compute the X°% values from the X°°° values, 
as discussed preceding Equation (14.2). 


ii. Compute the predicted value of average test scores for this school. 


b. The actual average test score for the school is 775.3. Compute the error 
for your prediction. 


c. The regression shown above was estimated using the standardized 
regressors and the demeaned value of TestScore. Suppose the regression 
had been estimated using the raw data for TestScore, RMP, and TExp. 
Calculate the values of the regression intercept and slope coefficients for 
this regression. 


d. Use the regression coefficients that you computed in (c) to predict average 
test scores for an out-of-sample school with RPM = 0.52 and TExp = 11.1. 
Verify that the prediction is identical to the prediction you computed in (a.ii). 


A school principal is trying to raise funds so that all her students will receive 
reduced-price meals; currently, only 40% qualify for reduced-priced meals. 
Can she use the regression in Exercise 14.1 to estimate the effect of the new 
policy on test scores? Explain why or why not. 


14.3 


14.4 


14.5 


14.6 


14.7 
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Describe the relationship, if any, between the standard error of a regression and 
the square root of the MSPE of the regression’s out-of-sample predictions. 


A large online retailer sells thousands of products. The retailer has detailed data on 
the products purchased by each of its customers. Explain how you would use these 
data to predict the next product purchased by a randomly selected customer. 


Y is a random variable with mean u = 2 and variance o° = 25. 


a. Suppose you know the value of u. 


i. What is the best (lowest MSPE) prediction of the value of Y? That is, 
what is the oracle prediction of Y? 


ii. What is the MSPE of this prediction? 


b. Suppose you don’t know the value of u but you have access to a random 
sample of size n = 10 from the same population. Let Y denote the sample 
mean from this random sample. You predict the value of Y using Y. 


i. Show that the prediction error can be decomposed as Y — Y = (Y — u) — 
(Y — u), where (Y — u) is the prediction error of the oracle predictor 
and (u — Y) is the error associated with using Y as an estimate of ju. 


ii. Show that (Y — u) has a mean of 0, that (Y — u) has a mean of 0, 
and that Y — Y has a mean of 0. 


iii. Show that (Y — u) and (Y — u) are uncorrelated. 
iv. Show that the MSPE of Y is MSPE = E(Y — u)? + E(Y - u)? = 


var(Y) + var(Y). 
v. Show that MSPE = 25(1 + 1/10) = 27.5. 
In Exercise 14.5(b), suppose you predict Y using Y/2 instead of Y. 


a. Compute the bias of the prediction. 

b. Compute the mean of the prediction error. 

c. Compute the variance of the prediction error. 

d. Compute the MSPE of the prediction. 

e. Does Y/2 produce a prediction with a lower MSPE than the Y prediction? 


f. Suppose u = 10 (instead of u = 2). Does Y/2 produce a prediction with 
a lower MSPE than the Y prediction? 


g. Ina realistic setting, the value of u is unknown. What advice would you 
give someone who is deciding between using Y and Y/2? 


In Exercise 14.5(b), suppose you predict Y using Y — 1 instead of Y. 


Compute the bias of the prediction. 


a. 
b. Compute the mean of the prediction error. 


O 


Compute the variance of the prediction error. 


bag 


Compute the MSPE of the prediction. 
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14.8 


14.9 


14.10 


e. Does Y — 1 produce a prediction with a lower MSPE than the Y prediction? 
f. Does Y — 1 produce a prediction with a lower MSPE than the Y/2 


prediction from Exercise 14.6? 


Let X and Y be two random variables. Denote the mean of Y given X = x by 
u(x) and the variance of Y by o7(x). 


a. Show that the best (minimum MSPE) prediction of Y given X = x is 
u(x) and the resulting MSPE is o7(x). (Hint: Review Appendix 2.2.) 


b. Suppose X is chosen at random. Use the result in (a) to show that the 
best prediction of Y is u(X) and the resulting MSPE is E[Y — (X)? = 


E[o°(X)]. 
You have a sample of size n = 1 with data yı = 2 and xı = 1. You are 
interested in the value of £ in the regression Y = X B + u. (Note there is no 
intercept.) 
a. Plot the sum of squared residuals (y, — bx) as function of b. 
b. Show that the least squares estimate of B is BC“S = 2. 


c. Using Apidge = 1, plot the ridge penalty term AriageD” as a function of b. 


Using Apidge = 1, plot the ridge penalized sum of squared residuals 
(yı = bx)? + Aridged”- 

Find the value of BRidse, 

Using Ariage = 0.5, repeat (c) and (d). Find the value of BRidse, 

Using Apiage = 3, repeat (c) and (d). Find the value of BR's, 

Use the graphs that you produced in (a)—(d) for the various values of 


T m © 


ÀRidge tO explain why a larger value of Apigge results in more shrinkage 
of the OLS estimate. 


You have a sample ofsize n = 1 with data y} = 2andx, = 1. You are interested 
in the value of £ in the regression Y = XB + u. (Note there is no intercept.) 


a. Plot the sum of squared residuals (y4 — bx)? as function of b. 

b. Show that the least squares estimate of £ is pees = 2. 

c. Using Azasso = 1, plot the Lasso penalty term Azassolb| as a function of b. 
d. Using Azasso = 1, plot the Lasso penalized sum of squared residuals 

(yı — bx)? + Axassolb|- 

Find the value of pia, 

Using Ajasso = 0.5, repeat (c) and (d). Find the value of pias. 

Using Ajasso = 5, repeat (c) and (d). Find the value of Biase, 


= m © 


Use the graphs that you produced in (a)—(d) for the various values of 
ALasso to explain why a larger value of A; asso results in more shrinkage 
of the OLS estimate. 
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14.11 Let X, and X, be two positively correlated random variables, both with variance 1. 


a. (Requires calculus) The first principal component, PC, is the linear 
combination of X, and X, that maximizes var(w,X; + w 7X), where 
wi + w3 = 1. Show that PC, = (X, + X%)/V2 . (Hint: First derive an 
expression for var(w,X, + w 2X2) as a function of w and w2.) 

b. The second principal component is PC, = (X — X2)/ V2. Show that 
cov(PC;, PC) = 0. 

c. Show that var(PC,) = 1 + p and var( PC) = 1 — p, where 
p = cor( x1, x2). 


14.12 Consider the fixed-effects panel data model Y, = aj + uj, for j = 1,...,k 
andt = 1,..., T. Assume that u; is i.i.d. across entities j and over time f with 
E(uj) = 0 and var (uy) = o}. 


a. The OLS estimator of a; is the value of a; that makes the sum of squared 
residuals >> ay, — aj) ? as small as possible. Show that the OLS 


= 1 
estimator is â; = Y, = 7 5. Y, 
b. Show that 


i. âj is an unbiased estimator of qj. 


ii. var(@;) = o;,/T. 
iii. cov(a@,a@;) = Ofori # j. 

ce. You are interested in predicting an out-of-sample value for entity ;—that 
is, for Y;7,;—and use âj as the predictor. Show that MSPE = o7, + o7,/T. 


d. You are interested in predicting an out-of-sample value for a randomly 
selected entity—that is, for Y, 7,;, where j is selected at random. You 
again use â; as the predictor. Show the MSPE = o? + 02 /T. 

e. The total number of in-sample observations is n = kT. Show that in 
both (c) and (d) MSPE = o7(1 + k/n). 


Empirical Exercises 


E14.1 On the text website, http://www.pearsonglobaleditions.com, you will find 
a data set CASchools_EE14 InSample that contains a subset of n = 500 
schools from the data set used in this chapter. Included are data on test 
scores and 20 of the primitive predictor variables; see CASchools_EE141_ 
Description for a description of the variables. In this exercise, you will con- 
struct prediction models like those described in the text and use these models 
to predict test scores for 500 out-of-sample schools. (Please read EE141_ 
SoftwareNotes on the text website before solving the exercise.) 


a. From the 20 primitive predictors, construct squares of all the 
predictors, along with all of the interactions (that is, the cross products 
XX, for all j and k). Collect the 20 primitive predictors, their squares, 
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and all interactions into a set of k predictors. Verify that you have 

20 + 20 + (20 X 19) /2 = 230 predictors. One of the primitive 
predictors is the binary variable charter_s. Drop the predictor 
(charter_s) from the list of 230 predictors, leaving 229 predictors for 
the analysis. Why should (charter_s)? be dropped from the original list 
of predictors? 


b. Compute the sample mean and standard deviation of each of the predic- 
tors, and use these to compute the standardized regressors. Compute the 
sample mean of TestScore, and subtract the sample mean from TestScore 
to compute its demeaned value. 


c. Using OLS, regress the demeaned value of TestScore on the standardized 
regressors. 


i. Did you include an intercept in the regression? Why or why not? 
ii. Compute the standard error of the regression. 


d. Using ridge regression with Apiage = 300, regress the demeaned value of 
TestScore on the standardized regressors. Compare the OLS and ridge 
estimates of the standardized regression coefficients. 


e. Using Lasso with Àzasso = 1000, regress the demeaned value of TestScore 
on the standardized regressors. How many of the estimated Lasso coeffi- 
cients are different from 0? Which predictors have a nonzero coefficient. 


f. Compute the scree plot for the 229 predictors. How much of the variance in 
the standardized regressors is captured by the first principal component? By 
the first two principal components? By the first 15 principal components? 


g. Compute 15 principal components from the 229 predictors. Regress the 
demeaned value of TestScore on the 15 principal components. 


h. On the text website, you will find a data set CASchools_EE14_ 
OutOfSample that contains data from another n = 500 schools. 


i. Predict the average test score for each of these 500 schools using the 
OLS, ridge, Lasso, and principal components prediction models that 
you estimated in (c), (d), (e), and (g). Compute the root mean square 
prediction error for each of the methods. 


ii. Construct four scatter plots like those in Figure 14.8. What do you 
learn from the plots? 


i Estimate Apiages ALasso, and the number of principal components using 
10-fold cross validation from the in-sample data set. 


j. Use the estimated values of Apidges Azasso, and the number of principal 
components from (i) to construct predictions of test scores for the out- 
of-sample schools. Are these predictions more accurate than the pre- 
dictions you computed in (h)? Is the difference in line with what you 
expected from the cross-validation calculations in (i)? 
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14.3 


The Ridge Regression Estimator When k = 1 551 


The California School Test Score Data Set 


The test scores used in this chapter are from the California Standards Tests (part of Califor- 
nia’s Standardized Testing and Reporting program) given to fifth-grade students in the spring 
of 2013. The average test score for each of California’s schools is available from the California 
Department of Education, where you can also find much of the other school and district data 
used in the chapter. The remaining school and district data were obtained from ED-Data 
(www.ed-data.org). All school and district data are for the 2012-13 academic year. In addition 
to school and district data, demographic data for 2013 are constructed from the census tracts 
making up the zip code for each school. These data are available from the American Com- 
munity Survey (see factfinder.census.gov). More detail is available in the replication files for 


the chapter at http://www.pearsonglobaleditions.com. 


Derivation of Equation (14.4) fork = 1 


With a single regressor, the OLS prediction in the standardized predictive regression model 
(Equation (14.2)) for a given value X = x is Y(x) = Bx. The second term in Equation (14.3) 
is E[(B — B)X0°%8)?? = E(B — B)°E(X2)? = E(B — B)’, where the first equality uses 
the independence of B and xX°°° (Ê is estimated using the in-sample data) and the second 
equality uses the fact that the regressors are standardized, so E(X??5)? = var( X95) = 1. 
Because the OLS estimator is unbiased in the prediction model, E(B - 6) = var(ĝ) = 
o+/(no%) = o2/n, where the second equality uses the large-n formula for the variance of the 
OLS estimator under homoskedasticity in Equation (5.27) and the final equality uses the fact 
that o% = 1 because the regressors in Equation (14.2) are standardized using the population 
mean and variance. It follows from Equation (14.3) that, with k = 1 under homoskedasticity, 
the MSPE of OLS = (1 + 1/n)o% for large n, which is Equation (14.4) with k = 1. 


The Ridge Regression Estimator When k = 1 


When k = 1, the ridge estimator minimizes the penalized sum of squares, S*!“8¢( b; Ridge) = 
Xi-1(Y, — bX)? + Nridged”- Taking the derivative of aaa fe ridge) With respect to b and 
setting the derivative equal to 0 yields — j_,X;( Y, — prideey) + ARiagel 8 = 0. Solving 
for Bs yields BR48 = Yi_1X,¥;/(Di=1X7 + pidge) = (1 + Ariage/ >i-1X7) Ê, where 
f= 2. LX;Y;/>/=1X? is the OLS estimator. 
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APPENDIX 


14.4 


APPENDIX 


14.5 


The Lasso Estimator When k = 1 


When k = 1, the Lasso minimizes the penalized sum of squared residuals, S“""? ( b; Azasso ) = 
YEG — bX)? + Azgssolb|. Inspection of Figure 14.3 shows that Ê and pE must have the 
same sign when k = 1. Suppose ĝis positive. Then, over the relevant range b = 0, the Lasso 
minimizes >}_,(Y; — bX;)? + AzasoD, and its derivative with respect to b is 
—2>7-1X) (Y; — bX) + Arasso. For Biase > 0, setting this derivative equal to 0 implies 
—2>7-1X) (Y, - pix.) + Xzasso = 0; otherwise, lasso = 0. Solving for pias yields 


pE = max( 6 = $ ÀLasso/ >, X, 0) when 8 = 0. (14.11) 
i=1 


Similar reasoning shows that piasse = min ( B + 5A Lasso/ >i=1X?,0) when B <0. 


Computing Out-of-Sample Predictions 
in the Standardized Regression Model 


The estimators of this chapter are all computed using the standardized predictive regression model 
in Equation (14.2). Computing the prediction for an out-of-sample observation entails first stan- 
dardizing the out-of-sample predictors, then computing the demeaned out-of-sample prediction, 
then adding back in the in-sample mean of Y. These transformations must all be done using the 
same means, variances, and weights for the out-of-sample data as for the in-sample data. Details are 


provided first for ridge regression and the Lasso, and then for principal components regression. 


Out-of-Sample Predictions Using the Standardized 
Regression Model of Equation (14.2) (Ridge and Lasso) 


Following Section 14.2, let X °°)... , X42° denote an out-of-sample observation on the original, 
untransformed values of the k predictors, and let Y"°®* denote the out-of-sample observation on 
the variable to be predicted. The transformed out-of-sample value of the j™ predictor is 
XPS = (X50 — Xj) /s x7, where Xj ands x; are the in-sample mean and standard deviation of 
the j'" predictor. Let B j be some estimator of £, e.g., the ridge regression or Lasso estimator. Then 


the predicted value of the original dependent variable in terms of the original predictors is 


. X08 =. xX. 
). (14.12) 


k 
poos = y* i 5( J J 


pi 


where Y“, X;, sx, and B;(j =1,...,k) are all computed using the estimation sample. 
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Out-of-Sample Predictions Using Principal 
Components Regression 


To compute the predicted value for an out-of-sample observation using principal components 
regression, it is necessary, in addition, to compute the out-of-sample values of the principal 
components using the in-sample weights. Let y denote the coefficients in the regression of Y 


on the first p principal components: 
Y; = y1 PCy + y2 PCy +... + YpPCpi + vir (14.13) 


where v; is an error term. The prediction of Y”? is computed in the following steps: 
1. Compute the principal components in the estimation sample: 


a. Compute the demeaned Y and standardized X for the in-sample observations on 


Y” and X” as described preceding Equation (14.2). 


b. Compute the in-sample principal components of X; call these PC), .. . , PCnminin,k): 
2. Given p, estimate the regression coefficients in Equation (14.13); call these estimates 

“PC “PC 

VT errea hp 


3. Compute the out-of-sample values of the principal components: 


a. Standardize the out-of-sample predictors X*°® using the in-sample mean and stan- 


dard deviation from step 1(a). Denote this transformed observation as X°”’. 


b. Compute the principal components for the out-of-sample observation using the 
in-sample weights from step 1(b); call these PC{®,..., PCR”. 


4. Compute the predicted value for the out-of-sample observation as yor = 
YS hey, POS, 
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5 Regression and Forecasting 


T series data—data collected for a single entity at multiple points in time—can be 
used to answer quantitative questions for which cross-sectional data are inadequate. 
One such question is, what is the causal effect on a variable of interest, Y, of a change in 
another variable, X, over time? In other words, what is the dynamic causal effect on Y of 
a change in X? For example, what is the effect on traffic fatalities of a law requiring pas- 
sengers to wear seatbelts, both initially and subsequently, as drivers adjust to the law? 
Another such question is, what is your best forecast of the value of some variable at a 
future date? For example, what is your best forecast of next month’s unemployment rate, 
interest rates, or stock prices? Both of these questions—one about dynamic causal 
effects, the other about economic forecasting—can be answered using time series data. 

This chapter and Chapters 16 and 17 introduce techniques for econometric analy- 
sis of time series data and apply those techniques to the problems of forecasting and 
estimating dynamic causal effects. This chapter introduces the basic concepts and 
tools of regression using time series data and applies them to economic forecasting. 
Chapter 16 applies these tools to the estimation of dynamic causal effects. Chapter 17 
takes up some more advanced topics in time series econometrics, including forecast- 
ing multiple time series, forecasting with many predictors, and modeling changes in 
volatility over time. 

Economic forecasting is the prediction of future values of economic variables. 
Firms use economic forecasts when they plan production levels. Governments use rev- 
enue forecasts when they develop their budgets for the upcoming year. Economists at 
central banks, like the U.S. Federal Reserve System, forecast economic variables includ- 
ing the inflation rate and the growth of Gross Domestic Product (GDP) as part of set- 
ting monetary policy. Wall Street investors rely on forecasts of profits when deciding 
whether to invest in a company. 

Forecasting is an application of the more general prediction problem in statistics, 
in which a given set of data is used to predict observations not in the data set. Fore- 
casting refers to the prediction of future values of time series data. As with prediction 
more generally, forecasting models need not and generally do not have a causal 
interpretation. 

Section 15.1 presents some examples of economic time series data and introduces 
basic concepts of time series analysis. Section 15.2 sets out the forecasting problem 
and introduces a measure of forecast accuracy, the mean squared forecast error. It 
also introduces the concept of stationarity, which implies that historical relationships 
among variables hold in the future, so that past data can reliably be used to make 
forecasts. Section 15.3 introduces autoregressions, time series regression models in 
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which the regressors are past values of the dependent variable, and Section 15.4 
explains how to include additional regressors. For example, we find that including the 
term spread (the difference between long- and short-term interest rates) improves 
forecasts of the growth of U.S. GDP relative to using only lagged values of GDP growth. 
Section 15.5 discusses how to estimate the mean squared forecast error and how to 
compute forecast intervals—that is, ranges that are likely to contain the actual value of 
the variable being forecasted. Section 15.6 describes methods for choosing the num- 
ber of lags in forecasting models. Sections 15.7 and 15.8 take up two common depar- 
tures from the assumption of stationarity, trends and breaks, and show how to modify 
forecasting regressions if they are present. 


Introduction to Time Series Data and 
Serial Correlation 


A good place to start any empirical analysis is plotting the data, so that is where we begin. 


Real GDP in the United States 


Gross Domestic Product (GDP) measures the value of goods and services produced 
in an economy over a given time period. Figure 15.1a plots the value of “real” GDP 
per year in the United States from 1960 through 2017, where “real” indicates that the 
values have been adjusted for inflation. The values of GDP are expressed in $2009, 
which means that the price level is held fixed at its 2009 value. Because U.S. GDP 
grows at approximately an exponential rate, Figure 15.1a plots GDP on a logarithmic 
scale. GDP increased dramatically over a recent 58-year period, from approximately 
$3 trillion in 1960 to over $17 trillion in 2017. Measured on a logarithmic scale, this 
greater-than-five-fold increase corresponds to an increase of 1.7 log points. The rate 
of growth was not constant, however, and the figure shows declines in GDP during 
the recessions of 1960-1961, 1970, 1974-1975, 1980, 1981-1982, 1990-1991, 2001, and 
2007-2009, episodes denoted by shading in Figure 15.1. 


Lags, First Differences, Logarithms, and Growth Rates 


The observation on the time series variable Y made at date t is denoted Y, and the 
total number of observations is denoted T. The interval between observations— that 
is, the period of time between observation t and observation t + 1—is some unit of 
time such as weeks, months, quarters (three-month units), or years. A set of T obser- 
vations on a time series variable Y is denoted Y}4,..., Yr, or {Y}, t = 1,..., T. This 
notation parallels the notation for cross-sectional data, in which the observations are 
denoted by į = 1,...,n. Ina given data set, the date t = 1 corresponds to the first 
date in the data set, and t = T corresponds to the final date in the data set. For 
example, the GDP data studied in this chapter are quarterly, so the unit of time 
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—__ 
| FIGURE 15.1 | The Logarithm and the Growth Rate of Real GDP in the United States, 1960-2017 
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(a period) is a quarter of a year. The data plotted in Figure 15.1b are quarterly growth 
rates of GDP from the first quarter of 1960, or 1960:Q1, through the fourth quarter 
of 2017 or 2017:Q4, for a total of T = 232 observations. 

The change in the value of Y between period t — 1 and period tis Y, — Y,—4; this 
change is called the first difference in the variable Y, In time series data, “A” is used 
to represent the first difference,so AY, = Y, — Y;,-1. 

Special terminology and notation are used to indicate future and past values 
of Y. The value of Y in the previous period (relative to the current period, f) is called 
its first lagged value (or, more simply, its first lag) and is denoted Y,_,. Its j" lagged 
value (or, more simply, its j lag) is its value j periods ago, which is Y,_ į Similarly, Y,+1 
denotes the value of Y one period into the future. 

Economic time series are often analyzed after computing their logarithms or the 
changes in their logarithms. One reason for this is that many economic series exhibit 
growth that is approximately exponential; that is, over the long run, the series tends 
to grow by a certain percentage per year on average. This implies that the logarithm 
of the series grows approximately linearly and is why Figure 15.1a plots the logarithm 
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Lags, First Differences, Logarithms, and Growth Rates 


scl 


e The first lag of a time series Y, is Y,_; its j™ lag is ve. 

e The first difference of a series, A Y, is its change between periods t — 1 and 
Ethatiss AY — 7 See 

e The first difference of the logarithm of Y,is Aln( Y,) = In(Y,) — In(Y,_1). 


e The percentage change of a time series Y, between periods t — 1 and t 
is approximately 100AIn(Y,), where the approximation is most accurate 
when the percentage change is small. 


of U.S. GDP. Another reason is that the standard deviation of many economic time 
series is approximately proportional to its level; that is, the standard deviation is well 
expressed as a percentage of the level of the series. This implies that the standard 
deviation of the logarithm of the series is approximately constant. In either case, it is 
useful to transform the series so that changes in the transformed series are propor- 
tional (or percentage) changes in the original series, and this is achieved by taking 
the logarithm of the series.! 

Lags, first differences, and growth rates are summarized in Key Concept 15.1. 

Lags, changes, and percentage changes are illustrated using the U.S. GDP data in 
Table 15.1. The first column shows the date, or period, where the fourth quarter of 
2016 is denoted 2016:04, the first quarter of 2017 is denoted 2017:Q1, and so forth. 
The second column shows the value of GDP in that quarter, the third column shows 
the logarithm of GDP, and the fourth column shows the growth rate of GDP (in 
percent at an annual rate). For example, from the fourth quarter of 2016 to the first 
quarter of 2017 GDP increased from $16,851 to $16,903 billion, which is a percentage 
increase of 100 x (16,903 — 16,851) /16,851 = 0.31%. This is the percentage 
increase from one quarter to the next. It is conventional to report rates of growth in 
quarterly macroeconomic time series on an annual basis, which is the percentage 
increase in GDP that would occur over a year if the series were to continue to 
increase at the same rate. Because there are four quarters in a year, the annualized 
rate of GDP growth in 2017:Q1 is 0.31 X 4 = 1.24, or 1.24%. 


The change of the logarithm of a variable is approximately equal to the proportional change of that 
variable; that is, In(X + a) — In(X) = a/X,where the approximation works best when a/X is small [see 
Equation (8.16) and the surrounding discussion]. Now, replace X with Y,_; and a withA Y, and note that 
Y, = Y,-,; + AY, This means that the proportional change in the series Y, between periods t — 1 and t is 
approximately In(Y,) — In(Y¥,_,) = In(Y¥,_,; + AY,) — In(¥_,) = AY,/Y,— (see Equation 18.16). The 
expression ln( Y,) — In( Y;-1) is the first difference of In( Y,) —that is, Aln( Y;). Thus Aln(Y,) = AY;,/ ¥;-1. 
The percentage change is 100 times the fractional change, so the percentage change in the series Y, is 
approximately 100Aln( Y,). 
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Quarter 
2016:04 
2017:Q1 
2017:Q2 
2017:Q3 
2017:04 


GDP in the United States in the Last Quarter of 2016 and in 2017 
U.S. GDP (billions Logarithm of Growth Rate of GDP at an Annual First Lag, 
of $2009), GDP, GDP, In(GDP,) Rate, GDPGR, = 400 x Aln (GDP,) GDPGR, _, 
16,851 9.732 1.74 2.74 
16,903 9.735 1.23 1.74 
17031 9.743 3.01 1.23 
17,164 9:751 3.11 3.01 
17,272 9.757 2.50 311 
Note: The quarterly rate of GDP growth is the first difference of the logarithm. This is converted into percentages at an annual rate by 
multiplying by 400. The first lag is its value in the previous quarter. All entries are rounded to the nearest decimal. 


= 


In the table, this percentage change is computed using the differences-of- 
logarithms approximation in Key Concept 15.1. The difference in the logarithm of 
GDP from 2016:Q4 to 2017:Q1 is In(16,903) — In(16,851) = 0.00308, yielding the 
approximate quarterly percentage difference 100 Xx 0.00308 = 0.308%. On an annu- 
alized basis, this is 0.308 x 4 = 1.23, or 1.23%, essentially the same as the change 
obtained by directly computing the percentage growth. These calculations can be 
summarized as 


Annualized rate of GDP growth = GDPGR, = 400 [In(GDP,) — In(GDP,_,) ] 
= 400AIn(GDP,), (15.1) 


where GDP, is the value of GDP at date t. The factor of 400 arises from converting 
the decimal change to a percentage (multiplying by 100) and then converting the 
quarterly percentage change to an equivalent annual rate (multiplying by 4). 

The final column of Table 15.1 illustrates lags. The first lag of GDPGR in 2017:Q1 
is 1.74%, the value of GDPGR in 2016:Q4. 

Figure 15.1b plots GDPGR, from 1960:Q1 through 2017:04. It shows substantial 
variability in the growth rate of GDP. For example, GDP grew at an annual rate of 
over 15% in 1978:Q2 and fell at an annual rate of over 8% in 2008:Q4. Over the 
entire period, the growth rate averaged 3.0% (which is responsible for the increase 
of GDP from $3.1 trillion in 1960 to $173 trillion in 2017), and the sample standard 
deviation was 3.3%. 


Autocorrelation 


In time series data, the value of Y in one period typically is correlated with its value 
in the next period. The correlation of a series with its own lagged values is called 
autocorrelation or serial correlation. The first autocorrelation (or autocorrelation 
coefficient) is the correlation between Y, and Y,_,—that is, the correlation between 
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Autocorrelation (Serial Correlation) and Autocovariance 


The j" autocovariance of a series Y, is the covariance between Y, and its j" lag, 


13.2 


and the j™ autocorrelation coefficient is the correlation between Y, and Y,_ $ 


th ae 
j™ autocorrelation = pp CO g) 


j™ autocovariance = cov( Y, Wip) (15.2) 


_ cov( ¥,¥.-) 
Z Vvar(Y,) var(¥;_;) 


(15.3) 


The j autocorrelation coefficient is sometimes called the j serial correlation 


values of Y at two adjacent dates. The second autocorrelation is the correlation 
between Y, and Y,_, and the j" autocorrelation is the correlation between Y, and 
Y,_;. Similarly, the j™ autocovariance is the covariance between Y, and Y,—;. 
Autocorrelation and autocovariance are summarized in Key Concept 15.2. 

The j™ population autocovariances and autocorrelations in Key Concept 15.2 
can be estimated by the j sample autocovariances and autocorrelations, 
A 


——__—. 1 £ = = 
cov( Y, Y,—;) = T > (X - Yair) (Y; — Yi:r-;) (15.4) 
t=j+1 
cov( Y, Y) 
a COV Fn Yi-j 
var(Y;), 


where Y,,1.r denotes the sample average of Y, computed using the observations 
t=j+1,...,7T and where var( Y) is the sample variance of Y. 

The first four sample autocorrelations of GDPGR, the growth rate of GDP, are 
Pi = 9.33, po = 0.26, p3 = 0.10, and py = 0.11. These values suggest that GDP 
growth rates are mildly positively autocorrelated: If GDP grows faster than average 
in one period, it tends to also grow faster than average in the following period. 


?The summation in Equation (15.4) is divided by T, whereas in the usual formula for the sample covari- 
ance [see Equation (3.24)], the summation is divided by the number of observations in the summation 
minus a degrees-of-freedom adjustment. The formula in Equation (15.4) is conventional for the purpose 
of computing the autocovariance. Equation (15.5) uses the assumption that var ( Y,) and var( Y,_;)are the 
same —an implication of the assumption that Y is stationary, a concept introduced in Section 15.3. 
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| FIGURE 15.2 | Four Economic Time Series 
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The four time series have markedly different patterns. The unemployment rate (Figure 15.2a) increases during recessions 
and declines during expansions. The exchange rate between the U.S. dollar and the British pound (Figure 15.2b) shows 
a discrete change after the 1972 collapse of the Bretton Woods system of fixed exchange rates. The logarithm of the 
Japan Index of Industrial Production (Figure 15.2c) shows decreasing growth. The daily percentage changes in the 
Wilshire 5000 Total Market Index, a stock price index (Figure 15.2d), are essentially unpredictable, but the variance 


Other Examples of Economic Time Series 


Economic time series differ greatly. Four examples of economic time series are plot- 
ted in Figure 15.2: the U.S. unemployment rate; the rate of exchange between the U.S. 
dollar and the British pound; the logarithm of the Japan Index of Industrial Produc- 
tion; and the percentage change in daily values of the Wilshire 5000 Total Market 
Index, a stock price index. 

The U.S. unemployment rate (Figure 15.2a) is the fraction of the labor force 
out of work, as measured in the Current Population Survey (see Appendix 3.1). 
Figure 15.2a shows that the unemployment rate increases by large amounts during 
recessions (the shaded areas in Figure 15.1) and falls during expansions. 

The dollar/pound exchange rate (Figure 15.2b) is the price of a British pound (£) 
in U.S. dollars. Before 1972, the developed economies ran a system of fixed exchange 
rates—called the Bretton Woods system— under which governments kept exchange 
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rates from fluctuating. In 1972, inflationary pressures led to the breakdown of this 
system; thereafter, the major currencies were allowed to “float”; that is, their values 
were determined by the supply and demand for currencies in the market for foreign 
exchange. Prior to 1972, the exchange rate was approximately constant, with the 
exception of a single devaluation in 1968, in which the official value of the pound 
relative to the dollar was decreased to $2.40. Since 1972, the exchange rate has fluctu- 
ated over a very wide range. 

The Japan Index of Industrial Production (Figure 15.2c) measures Japan’s output 
of industrial commodities. The logarithm of the series is plotted in Figure 15.2c, and 
changes in this series can be interpreted as (fractional) growth rates. During the 1960s 
and early 1970s, Japanese industrial production grew quickly, but this growth slowed in 
the late 1970s and 1980s, and industrial production has grown little since the early 1990s. 

The Wilshire 5000 Total Market Index is an index of the share prices of all 
firms traded on exchanges in the United States. Figure 15.2d plots the daily per- 
centage change in this index for trading days from January 2, 1990, to December 
29, 2017 (a total of 7305 observations). Unlike the other series in Figure 15.2, there 
is very little serial correlation in these daily percentage changes; if there were, then 
you could predict them using past daily changes and make money by buying when 
you expect the market to rise and selling when you expect it to fall. Although the 
changes are essentially unpredictable, inspection of Figure 15.2d reveals patterns 
in their volatility. For example, the standard deviation of daily percentage changes 
was relatively large in 1998-2003 and 2007-2012, and it was relatively small in 1994, 
2004, and 2017 This volatility clustering is found in many financial time series, and 
econometric models for modeling this special type of heteroskedasticity are taken 
up in Section 175. 


Stationarity and the Mean Squared 
Forecast Error 


Stationarity 


Time series forecasts use data on the past to forecast the future. Doing so presumes 
that the future is similar to the past in the sense that the correlations, and more gener- 
ally the distributions, of the data in the future will be like they were in the past. If the 
future differs fundamentally from the past, then historical relationships might not be 
reliable guides to the future. 

In the context of regression with time series data, the idea that historical rela- 
tionships can be generalized to the future is formalized by the concept of stationarity. 
The precise definition of stationarity, given in Key Concept 15.3, is that the probabil- 
ity distribution of the time series variable does not change over time. Under the 
assumption of stationarity, regression models estimated using past data can be used 
to forecast future values. 
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Stationarity 


15.3 


A time series Y, is stationary if its probability distribution does not change over 
time — that is, if the joint distribution of (Y,11, Yi2,..., Y47) does not depend 
on s, regardless of the value of T; otherwise, Y, is said to be nonstationary. A pair 
of time series, X, and Y, are said to be jointly stationary if the joint distribution 
of (X,44, Yai, X12 Yor, ---,Xsi7, %47) does not depend on s, regardless of 
the value of T. Stationarity requires the future to be like the past, at least in a 
probabilistic sense. 


Stationarity can fail to hold for multiple reasons, in which case the time series is 
said to be nonstationary. One reason is that the unconditional mean might have a 
trend. For example, the logarithm of U.S. GDP plotted in Figure 15.1a has a persistent 
upward trend, reflecting long-term economic growth. Another type of nonstationar- 
ity arises when the population regression coefficients change at a given point in time. 
Ways to detect and to address these two types of nonstationarity are taken up in 
Sections 15.6 and 15.7 Until then, we assume that the time series is stationary. 


Forecasts and Forecast Errors 


This chapter considers the problem of forecasting the value of a time series vari- 
able Y in the period immediately following the end of the available data—that is, 
of forecasting Y;,, using data through date T. This forecast answers questions 
such as, Given data through the current quarter, what is my forecast of GDP 
growth for the next quarter? Because the forecast is for the next time period, this 
forecast is called a one-step ahead forecast. A more ambitious question is, Given 
data through the current quarter, what is my forecast of GDP growth for each of 
the next eight quarters? Answering that question entails making a forecast over a 
longer horizon, called a multi-step ahead forecast. Multi-step ahead forecasts are 
taken up in Chapter 17. 

We let Vs 17 denote a candidate one-step ahead forecast of Yr- ;. In this notation, 
the subscript T + 1|T indicates that the forecast is of the value of Y at time T + 1, made 
using data through time T, and the caret (©) indicates that the forecast is based on an 
estimated model. For example, suppose you have quarterly observations on GDP 
growth (Y) from 1960:Q1 to 2017:Q3. The one-step ahead forecasting problem is to use 
these data to forecast GDP growth in 2017:Q4, and the forecast is denoted Yo017-04)2017-03- 

Because the future is unknown, errors in forecasting are inevitable. The forecast 
error is the difference between the actual value of Y7,, and its forecast: 


Forecast error = Yr, — Yreijr- (15.6) 
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A forecast refers to a prediction made for a future date that is not in the data set 
used to make the forecast—that is, the forecast is for an out-of-sample future obser- 
vation. The forecast error is the mistake made by the forecast, which is realized only 
after time has elapsed and the actual value of Yr,4 is observed. 


The Mean Squared Forecast Error 


Because forecast errors are inevitable, the aim of the forecaster is not to eliminate 
errors but rather to make them as small as possible — that is, to make the forecasts as 
accurate as possible. To make this goal precise, we need a quantitative measure of 
what it means for a forecast error to be small. The most commonly used measure, 
which we adopt in this text, is the mean squared forecast error (MSFE), which is the 
expected value of the square of the forecast error: 


MSFE = E| (Yr+1 — ¥reijr)?]- (15.7) 


The MSFE is the time series counterpart of the mean squared prediction error intro- 
duced in Section 14.2 for out-of-sample prediction with cross-sectional data. 

In practice, large forecast errors can be much more costly than small ones. A 
series of small forecast errors often causes only minor problems for the user, but a 
single very large forecast error can call the entire forecasting activity into question. 
The MSFE captures this idea by using the square of the forecast error, so that large 
errors receive a much greater penalty than small ones. 

The root mean squared forecast error (RMSFE) is the square root of the MSFE. 
The RMSFE is easily interpreted because it has the same units as Y. If the forecast 
is unbiased, forecast errors have mean zero and the RMSFE is the standard deviation 
of the out-of-sample forecast made using a given model. 

The MSFE incorporates two sources of randomness. The first is the randomness 
of the future value, Yr+1. The second is the randomness arising from estimating a 
forecasting model. For example, suppose a forecaster uses a very simple model, in 
which the value of Y;, is forecasted to be its historical mean value, uy. (This simple 
model is a plausible starting point for forecasting stock returns, as discussed in 
the box “Can You Beat the Market?” later in this section.) Because the mean is 
unknown, it must be estimated—say, by fy. In this example, the forecast 
is Yreir = py, the forecast error is Yr+ı — Yreir = Yr,1 — fy, and the MSFE is 
MSFE = E[(Yr+1 — fiy)?]. By adding and subtracting wy, if Y7,1 is uncorrelated 
with jiy,the MSFE can be writtenas MSFE = E[ (Yr; — py)?] + El (fy — uy)’]. 
The first term in this expression is the error the forecaster would make if the population 
mean were known:This term captures the random future (out-of-sample) fluctuations in 
Y;+, around the population mean. The second term in this expression is the additional 
error made because the population mean is unknown, so the forecaster must estimate it. 

From the perspective of the MSFE, the best-possible prediction is the condi- 
tional mean given the in-sample observations on Y—that is, E( Yr+1| Yi,- - , Yr) 
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Can You Beat the Market? 


H ave you ever dreamed of getting rich quickly 
by beating the stock market? If you think that 


the market will be going up, you should buy stocks 
today and sell them later, before the market turns 
down. If you are good at forecasting swings in stock 
prices, then this active trading strategy will produce 
better returns than a passive “buy and hold” strat- 
egy, in which you purchase stocks and just hang onto 
them. The trick, of course, is having a reliable fore- 
cast of future stock returns. 

Forecasts based on past values of stock returns 
are sometimes called momentum forecasts: If the 
value of a stock rose this month, perhaps it has 


momentum and will also rise next month. If so, then 


returns will be autocorrelated, and the autoregres- 
sive model will provide useful forecasts. You can 
implement a momentum-based strategy for a spe- 
cific stock or for a stock index that measures the 
overall value of the market. 

Table 15.2 presents autoregressive models of 
the excess return on a broad-based index of stock 
prices, called the CRSP value-weighted index, using 
monthly data from 1960:M1 to 2002:M12, where M1 
denotes the first month of the year (January), M2 
denotes the second month, and so forth. The monthly 
excess return is what you earn, in percentage terms, 
by purchasing a stock at the end of the previous 


month and selling it at the end of this month minus 


x 


A 
Autoregressive Models of Monthly Excess Stock Returns, 1960:M1—2002:M12 


Dependent variable: excess returns on the CRSP value-weighted index 
(1) (2) (3) 
Specification AR(1) AR(2) AR(4) 
Regressors 
excess return, — 1 0.050 0.053 0.054 
(0.051) (0.051) (0.051) 
excess return, —0.053 0.054 
(0.048) (0.048) 
excess return,—3 0.009 
(0.050) 
excess return,—4 —0.016 
(0.047) 
Intercept 0.312 0.328 0.331 
(0.197) (0.199) (0.202) 
F-statistic for 0.968 1.342 0.707 
coefficients on lags (0.325) (0.261) (0.587) 
of excess return 
(p-value) 
R? 0.0006 0.0014 —0.0022 


Note: Excess returns are measured in percentage points per month. The data are described in Appendix 15.1. All regressions are estimated 
over 1960:M1—2002:M12 (T = 516 observations), with earlier observations used for initial values of lagged variables. Entries in the 
regressor rows are coefficients, with standard errors in parentheses. The final two rows report the F-statistic testing the hypothesis that 
the coefficients on lags of excess return in the regression are 0, with its p-value in parentheses, and the adjusted R°, or R?. 


— 


what you would have earned had you purchased a 
safe asset (a U.S. Treasury bill). The return on the 
stock includes the capital gain (or loss) from the 
change in price plus any dividends you receive dur- 
ing the month. The data are described further in 
Appendix 15.1. 

Sadly, the results in Table 15.2 are negative. The 
coefficient on lagged returns in the AR(1) model is not 
statistically significant, and we cannot reject the null 
hypothesis that the coefficients on lagged returns are 
all 0 in the AR(2) or AR(4) model. In fact, the adjusted 
R?, or R?, of one of the models is negative, and those of 
the other two are only slightly positive, suggesting that 


none of these models is useful for forecasting. 


(Appendix 2.2). This best-possible forecast, E( ¥r+1| Yi, .. 


15.3 Autoregressions 


These negative results are consistent with the 
theory of efficient capital markets, which holds 
that excess returns should be unpredictable 
because stock prices already embody all currently 
available information. The reasoning is simple: If 
market participants think that a stock will have a 
positive excess return next month, then they will 
buy that stock now, but doing so will drive up the 
price of the stock to exactly the point at which 
there is no expected excess return. As a result, we 
should not be able to forecast future excess returns 
by using past publicly available information, and 
we cannot do it, at least using the regressions in 
Table 15.2. 


., Yr), is called the oracle 
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forecast. The oracle forecast is infeasible because the conditional mean is unknown 
in practice. Because it minimizes the MSFE, the oracle forecast is a conceptual 
benchmark against which to assess an actual forecast. 

The MSFE is an unknown population expectation, so to use it in practice it must 
be estimated using data. We discuss estimation of the RMSFE in Section 15.4. 


Taa 


Autoregressions 


If you want to predict the future, a good place to start is the immediate past. For 
example, if you want to forecast the rate of GDP growth in the next quarter, you 
might use data on how fast GDP grew in the current quarter or perhaps over the past 
several quarters as well. To do so, a forecaster would fit an autoregression. 


The First-Order Autoregressive Model 


An autoregression expresses the conditional mean of a time series variable Y, as a 
linear function of its own lagged values. A first-order autoregression uses only one 
lag of Y in this conditional expectation. That is, in a first-order autoregression, 
E(Y,| Ya, Y-2,...) = Bo + Br Y-1. The first-order autoregression [AR(1)] model 
can be written in the familiar form of a regression model as 


Y, = Po + BiYi-1 + uy, (15.8) 


where u, is the error term. The first-order autoregression in Equation (15.8) is a popu- 
lation autoregression with two unknown coefficients, By and A. 


566 


CHAPTER 15 Introduction to Time Series Regression and Forecasting 


The unknown population coefficients By and B, in Equation (15.8) can be 
estimated by ordinary least square (OLS). How to estimate By and 6, might ini- 
tially seem puzzling: Unlike a cross-sectional regression with X on the right-hand 
side, Equation (15.8) has Y on both the right- and the left-hand sides! The solu- 
tion to this puzzle is to realize that the variable Y, on the right-hand side differs 
from the dependent variable Y, because the regressor is the first lag of Y. That is, 
Equation (15.8) has the form of a standard regression model, with X being the 
first lag of Y. Thus, to estimate By) and B;, you must create a new variable — the 
first lag of Y—and then use that as the regressor. Doing so yields the OLS estima- 
tors, Bo and Bi. 

To make this concrete, consider estimating a first-order autoregression for GDP 
growth. Observations on the dependent variable, Y, = GDPGR,, are given in the 
fourth column of Table 15.1 for 2016:04-2017:04. Data on the regressor, 
Yı = GDPGR,, for those dates are given in the final column of Table 15.1. Thus 
the OLS estimator is obtained by regressing the data in the fourth column of 
Table 15.1 (extended back to the start of the sample) against the data in the final 
column, including an intercept. To estimate this AR(1) model, we use data starting in 
1962:Q1 and reserve the final observation, 2017:Q4, to illustrate computing the fore- 
cast and forecast error. The resulting first-order autoregression, estimated using data 
from 1962:Q1-2017:Q3, is 


GDPGR, = 1.950 + 0.341 GDPGR,_,. (15.9) 
(0.322) (0.073) 


As usual, standard errors are given in parentheses under the estimated coefficients, and 
GDPGR is the predicted value of GDPGR based on the estimated regression line. 


Forecasts and forecast errors. If the population coefficients in Equation (15.8) were 
known, then the one-step ahead forecast of Y7,1, made using data through date T, 
would be By + 8B, Yr. Although fp and £, are unknown, the forecaster can use their 
OLS estimates instead. Accordingly, the forecast based on the AR(1) model in Equa- 
tion (15.8) is 


Yrair = By + BiYr, (15.10) 


where By and ĝ; are estimated using historical data through time T. The forecast error 
is Yr+1 — Yr+ijr- 


Application to GDP growth. What is the forecast of the growth rate of GDP in 
the fourth quarter of 2017 (2017:Q4) that a forecaster would have made in 
2017:Q3, based on the estimated AR(1) model in Equation (15.9) (which was 
estimated using data through 2017:Q3)? According to Table 15.1, the growth rate 
of GDP in 2017:Q3 was 3.11% (so GDPGR17.93=3.11). Plugging this value 
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into Equation (15.8), the forecast of the growth rate of GDP in 2017:04 is 
GDPGRy917:04)2017:03 = 1-950 + 0.341 X GDPGRy917.93 = 1.950 + 0.341 X 3.11 = 3.0 
(rounded to the nearest tenth).Thus, the AR(1) model forecasts that the growth rate 
of GDP will be 3.0% in 2017:Q4. Because data for 2017:04 are available, we can 
evaluate the forecast error for this forecast. Table 15.1 shows that the actual growth 
rate of GDP in 2017:04 was 2.5%, so the AR(1) forecast is high by 0.5 percentage 
points; that is, the forecast error is -0.53 

The R? of the AR(1) model in Equation (15.9) is only 0.11, so the lagged value 
of GDP growth explains only a small fraction of the variation in GDP growth in the 
sample used to fit the autoregression. It is therefore of interest to see whether includ- 
ing additional variables, beyond the first lag, could improve the fit of the forecasting 
model. 


The p‘-Order Autoregressive Model 


The AR(1) model uses Y,_; to forecast Y, but doing so ignores potentially useful 
information in the more distant past. One way to incorporate this information is to 
include additional lags in the AR(1) model; this yields the p™-order autoregressive 
model. 

The p"-order autoregressive [AR(p)] model represents Y, as a linear function of 
p ofits lagged values; that is,in the AR(p) model, the regressors are Y,_;, Y¥;-2,.-., Y—ps 
plus an intercept. The number of lags, p, included in an AR(p) model is called the 
order, or lag length, of the autoregression. 

For example, an AR(2) model of GDP growth uses two lags of GDP growth as 
regressors. Estimated by OLS over the period 1962:Q1-2017:Q3, the AR(2) model is 


jn OO 
GDPGR, = 1.60 + 0.283GDPGR,_; + 0.18GDPGR,_>. (15.11) 
(0.37) (0.08) (0.08) 


The coefficient on the additional lag in (Equation (15.11)) is significantly different 
from 0 at the 5% significance level: The t-statistic is 2.30 (p-value = 0.02). This is 
reflected in an improvement in the R? from 0.11 for the AR(1) model in Equation (15.8) 
to 0.14 for the AR(2) model. 

The AR(p) model is summarized in Key Concept 15.4. 


Properties of the forecast and error term in the AR(p) model. The assumption that 
the conditional expectation of u, is 0 given past values of Y,—that is, 
E(u;| ¥;-1, ¥;-2,...) = 0]—has two important implications. 

The first implication is that the best forecast of Y;,, based on its entire 
history depends on only the most recent p past values. Specifically, let 
Yruijr = E(Yr+1 | Yr,Yr—-1, ...) denote the conditional mean of Y;,, given its 


>The units of the arithmetic difference between two percentages is percentage points. For example, if an 
interest rate is 3.5% at an annual rate and it rises to 3.8%, then it has risen by 0.3 percentage points. 
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Autoregressions 


15.4 


15.4 


The p™-order autoregressive [AR(p)] model represents the conditional expecta- 
tion of Y, as a linear function of p of its lagged values: 


By bie Foy Se tb Ue (15.12) 


where E(u,| Y,-1, ¥;-2,...) = 0.The number of lags p is called the order, or the 
lag length, of the autoregression. 


entire history. Then Y;+,|7 is the oracle forecast and has the smallest MSFE of any 
forecast, based on the history of Y (Exercise 15.5). That is, if Y, follows an AR(p), then 


the oracle forecast of Yr+ı based on Yy, Yr_1,... is 
Yriijr = Bo + BiYr + BYr-1 + +t + BD¥r-p1- (15.13) 
In practice, the coefficients Bp, B;, . . . , Bp are unknown, so actual forecasts from an 


AR(p) use Equation (15.13) with estimated coefficients. 
The second implication is that the errors u; are serially uncorrelated. This result 
follows from Equation (2.28) (Exercise 15.5). 


Application to GDP growth. What is the forecast of the growth rate of GDP in 
2017:04, using data through 2017:Q3, based on the AR(2) model of GDP growth 
in Equation (15.11)? To compute this forecast, substitute the values of GDP growth 
in 2017:Q2 and 2017:Q3 into Equation (15.11): GDPGR3917-04)2017.93 = 160 + 
0.28 GDPGRy 47.93 + 0.18 GDPGRy 7.92 = 1.60 + 0.28 X 3.11 + 0.18 X 3.01 = 3.0, 
where the 2017:Q3 and 2017:Q2 values for GDPGR are taken from the fourth 
column of Table 15.1. The forecast error is the actual value, 2.5%, minus the forecast, 
or 2.5% — 3.0% = —0.5 percentage points, essentially the same as the AR(1) fore- 
cast error. 


Time Series Regression with Additional 
Predictors and the Autoregressive 
Distributed Lag Model 


Economic theory often suggests other variables that could help forecast a variable 
of interest. These other variables, or predictors, can be added to an autoregression to 
produce a time series regression model with multiple predictors. When other 
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Long-term and 
short-term inter- 
est rates on bonds 


move together but 


not one-for-one. 
The difference 


between long-term 8 


rates and short- 


term rates is called 4 


the term spread. 
The term spread 
has fallen sharply 
before U.S. reces- 
sions, which are 
shown as shaded 
regions in the 
figures. 
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Interest Rates and the Term Spread, 1960-2017 
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(a) 10-year interest rate and 3-month interest rate 
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variables and their lags are added to an autoregression, the result is an autoregressive 
distributed lag model. 


Forecasting GDP Growth Using the Term Spread 


Interest rates on long-term and short-term bonds move together but not one for one. 
Figure 15.3a plots interest rates on 10-year U.S. Treasury bonds and 3-month Treasury 
bills from 1960 through 2017 These interest rates show the same long-run tendencies: 
Both were low in the 1960s, both rose through the 1970s and peaked in the early 1980s, 
and both fell subsequently. But the gap, or difference, between the two interest rates has 
not been constant: While short-term rates are generally below long-term rates, the gap 
between them narrows and even disappears shortly before the start of a recession; reces- 
sions are shown as the shaded bars in the figure. This difference between long-term and 
short-term interest rates is called the term spread and is plotted in Figure 15.3b. The term 
spread is generally positive, but it falls toward or below 0 before recessions. 
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Figure 15.3 suggests that the term spread might contain information about the 
future GDP growth that is not already contained in past values of GDP growth. This 
conjecture is readily checked by augmenting the AR(2) model in Equation (15.11) 
to include the first lag of the term spread: 


GDPGR, = 0.94 + 0.27GDPGR,_, + 0.19GDPGR,_, + 0.42 TSpread,_. 
(0.47) (0.08) (0.08) (0.18) (15.14) 


The t-statistic on TSpread,_; is —2.34, so this coefficient is significant at the 1% level. 
The R? of this regression is 0.16, an improvement over the AR(2) R? of 0.14. 

The forecast of the rate of GDP growth in 2017:Q4 is obtained by substituting 
the 2017:Q2 and 2017:Q3 values of GDP growth into Equation (15.14), along with 
the value of the term spread in 2017:Q3 (which is 1.21); the resulting forecast is 
GDPGRy917.04\2017:08 = 2.9%, and the forecast error is —0.4%. 

If one lag of the term spread is helpful for forecasting GDP growth, more lags 
might be even more helpful; adding an additional lag of the term spread yields 


GDPGR, = 0.94 + 0.25GDPGR,_ + 0.18GDPGR,_» 


(0.46) (0.08) (0.08) (15.15) 


— 0.13 TSpread,_1 + 0.62 TSpread,_>. 
(0.42) (0.43) 


The t-statistic testing the significance of the second lag of the term spread is 1.46 
(p-value = 0.14), so it falls just short of statistical significance at the 10% level. The 
R? of the regression in Equation (15.15) is 0.16, essentially the same as that in 
Equation (15.14). 

The forecasted rate of GDP growth in 2017:Q4 is computed by substituting the 
values of the variables into Equation (15.15). The term spread was 1.37 in 2017:Q2 
and 1.21 in 2017:Q3. The forecasted value of the rate of GDP growth in 2017:04, 
based on Equation (15.15), is 


GDPGRyi7-04n017:03 = 0.94 + 0.25 X 3.11 + 0.18 x 3.01 
— 0.13 X 1.21 + 0.62 X 137 = 2.9. (15.16) 


The forecast error is —0.4 percentage points. 


The Autoregressive Distributed Lag Model 


Each model in Equations (15.14) and (15.15) is an autoregressive distributed lag 
(ADL) model: autoregressive because lagged values of the dependent variable are 
included as regressors, as in an autoregression, and distributed lag because the regres- 
sion also includes multiple lags (a “distributed lag”) of an additional predictor. In 
general, an ADL model with p lags of the dependent variable Y, and q lags of an 


15.4 Time Series Regression with Additional Predictors and the Autoregressive Distributed Lag Model 571 


The Autoregressive Distributed Lag Model 


The autoregressive distributed lag model with p lags of Y, and q lags of X,, denoted 


ADL(p, q), is 


where Bp, Bi, .-- 


with E(u; | YE 


T9 


Y= Boot Pili T eaa to ee 


15.17 
TORU I + 6X2 t+ + + O + Uy oe 


, Bp, 81, - - - , 5g, are unknown coefficients and u, is the error term 
K DBS JAGA aE T) = 0; 


additional predictor X, is called an ADL(p, q) model. In this notation, the model in 
Equation (15.14) is an ADL(2, 1) model, and the model in Equation (15.15) is an 
ADL(2, 2) model. 

The ADL model is summarized in Key Concept 15.5. The notation in 
Equation (15.17) is somewhat cumbersome, and alternative optional notation, based 
on the so-called lag operator, is presented in Appendix 15.3. 

The assumption that the errors in the ADL model have a conditional mean of 0 given 
all past values of Y and X—that is, that E(u,| Y;-1, Y-2,...,X-1, X-2,...) =O — 
implies that no additional lags of either Y or X belong in the ADL model. In other 
words, the lag lengths p and q are the true lag lengths, and the coefficients on addi- 
tional lags are 0. 


The Least Squares Assumptions for Forecasting 
with Multiple Predictors 


The general time series regression model with multiple predictors extends the ADL 
model to include multiple predictors and their lags. The model is summarized in 
Key Concept 15.6. The presence of multiple predictors and their lags leads to double 
subscripting of the regression coefficients and regressors. 

The assumptions in Key Concept 15.6 are the time series counterparts of the four 
least squares assumptions for prediction with multiple regression using cross- 
sectional data (Appendix 6.4). 

The first assumption is that u, has conditional mean 0 given the history of all the 
regressors. This assumption extends the assumption used in the AR and ADL models 
and implies that the oracle forecast of Y, using all past values of Y and the X’s is given 
by the regression in Equation (15.18). 

The second least squares assumption for cross-sectional data is that 
(Xj, ..., Xm Y), i= 1,..., n, are independently and identically distributed 
(i.i.d.). The second assumption for time series regression replaces the i.i.d. assump- 
tion by a more appropriate one with two parts. Part (a) is that the data are drawn 
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The Least Squares Assumptions for Forecasting 


156 


with Time Series Data 


The general time series regression model allows for k additional predictors, 
where q; lags of the first predictor are included, q lags of the second predictor 
are included, and so forth: 


X= ot Oea a Eaa a aa Ee 
JF ô11Xir-1 ar ô12X1r -2 gp Poo cP Otay (15.18) 
TP aea ar OA ail IP Ope ga I 19° SP Oe Aves, AP Ohi 


where 
L E(u, | Wils r ocd ee re SR Nig oA eae 3 o) T 0; 
2. (a) The random variables (Y, Xi» .. ., X;,,) have a stationary distribution, and 


ORFA N and e A e become independent as 
j gets large; 
3. Large outliers are unlikely: Xi; . . ., Xy and Y, have nonzero, finite fourth 
moments; and 


4. There is no perfect multicollinearity. 


from a stationary distribution, so that the distribution of the time series today is the 
same as its distribution in the past. This assumption is a time series version of the 
identically distributed part of the i.i.d. assumption: The cross-sectional requirement 
of each draw being identically distributed is replaced by the time series requirement 
that the joint distribution of the variables, including lags, not change over time. If the 
time series variables are nonstationary, then one or more problems can arise in time 
series regression, including biased forecasts. 

The assumption of stationarity implies that the conditional mean for the data 
used to estimate the model is also the conditional mean for the out-of-sample obser- 
vation of interest. Thus the assumption of stationarity is also an assumption about 
external validity, and it plays the role of the first least squares assumption for predic- 
tion in Appendix 6.4. 

Part (b) of the second assumption requires that the random variables become 
independently distributed when the amount of time separating them becomes large. 
This replaces the cross-sectional requirement that the variables be independently 
distributed from one observation to the next with the time series requirement that 
they be independently distributed when they are separated by long periods of time. 
This assumption is sometimes referred to as weak dependence, and it ensures that in 
large samples there is sufficient randomness in the data for the law of large numbers 
and the central limit theorem to hold. For a precise mathematical statement of the 
weak dependence condition, see Hayashi (2000, Chapter 2). 
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The third assumption (no outliers) and fourth assumption (no perfect multicol- 
linearity) are the same as for cross-sectional data. 

Under the assumptions of Key Concept 15.6, inference on the regression coeffi- 
cients using OLS proceeds in the same way as it usually does using cross-sectional data. 


Estimation of the MSFE and 
Forecast Intervals 


An estimate of the MSFE can be used to summarize forecast uncertainty and to 
construct forecast intervals. 


Estimation of the MSFE 


The MSFE, defined in Equation (15.7), is an expected value that depends on the 
distribution of Y and on the forecasting model. Because it is an expectation, its value 
is not known and must be estimated from the data. 

A natural instinct would be to estimate the MSFE by replacing the expectation in 
Equation (15.7) with an average over out-of-sample observations. The out-of-sample 
data, however, are not observed, so this approach is not feasible. Instead, there are 
three commonly used methods, with increasing complexity, for estimation of the MSFE. 
All three methods necessarily rely on the in-sample data. The simplest estimator 
focuses only on future uncertainty and ignores uncertainty associated with estimation 
of the regression coefficients. The second estimator incorporates future uncertainty and 
estimation error, under the assumption of stationarity so that the conditional expecta- 
tion estimated by the model applies to the out-of-sample forecast. The third incorpo- 
rates uncertainty and estimation error and in addition allows for the possibility that the 
conditional expectation might change over the course of the sample. 

The first two methods are based on an expression for the MSFE derived from 
Equation (15.7) and the assumption of stationarity. We provide this expression here 
for an AR(p); it extends directly to the models with additional predictors in Key 
Concept 15.6. Under the assumption of stationarity, 


MSFE = ø} + var(By + ÊiYr + +++ + ÊpYr-p+1)- (15.19) 


This result is shown for an AR(1) in Exercise 15.12. 

The first term in Equation (15.19) is the variance of Y7, around its conditional 
mean. This is the variance of the oracle forecast. The second term in Equation (15.19) 
arises because the coefficients of the autoregression are unknown and must be 
estimated. 


Method 1: Estimating the MSFE by the standard error of the regression. Because 
the variance of the OLS estimator is proportional to 1/7, the second term in 
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Equation (15.19) is proportional to 1/7. Consequently, if the number of observations T 
is large relative to the number of autoregressive lags p, then the contribution of the 
second term is small relative to the first term. That is, if T is large relative to p, 
Equation (15.19) simplifies to the approximation MSFE ~ ø% . This simplification in 
turn suggests estimating the MSFE by 


SSR 


MSFE srr = 83, where s% = ——~-—, 
SER Sa, where s Tope 


u 


(15.20) 


where SSR is the sum of squared residuals of the autoregression. The statistic s% is the 
square of the standard error of the regression (SER), originally defined in Equation (6.13) 
and restated in Equation (15.20) using the notation of autoregressions. 


Method 2: Estimating the MSFE by the final prediction error. If Tis not large relative 
to p, the sampling error of the estimated autoregression coefficients can be sufficiently 
large that the second term in Equation (15.19) should not be ignored. The final 
prediction error (FPE) is an estimate of the MSFE that incorporates both terms in 
Equation (15.19), under the additional assumption that the errors are homoskedastic. 
With homoskedastic errors, var (By + BiYr i B- Yro) ~o7[(p +1)/T] 
(shown in Appendix 19.7); substitution of this expression into Equation (15.19) 
yields, MSFE = o% + o? pii = ø?(1 + ep") The FPE uses this final expression, 


along with the estimator s2, to estimate the MSFE: 


2 (15.21) 


prea T+p+1 T+p+1\SSR 
USE re = ( p J = p js 


T T=p=1/ T` 


The FPE estimator improves upon the squared SER in Equation (15.20) by adjusting 
for the sampling uncertainty in estimating the autoregression coefficients. 


Method 3: Estimating the MSFE by pseudo out-of-sample forecasting. The third 
estimate of the MSFE uses the data to simulate out-of-sample forecasting. This 
method proceeds by first dividing the data set into two parts: an initial estimation 
sample (the first T-P observations) and a reserved sample (the final P observations). 
The initial estimation sample is used to estimate the forecasting model, which is then 
used to forecast the first observation in the reserved sample. Next the estimation 
sample is augmented by the first observation in the reserved sample, and the model 
is reestimated and is used to forecast the second observation in the reserved sample. 
This procedure is repeated until the forecast is made for the final observation in the 
reserved sample and produces P forecasts and thus P forecast errors. Those P forecast 
errors can then be used to estimate the MSFE.* 


“Readers of Chapter 14 will recognize that this method for estimating the MSFE is related to estimation 
of the mean squared prediction error by cross validation. 
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Pseudo Out-of-Sample Forecasts 


Pseudo out-of-sample forecasts are computed using the following steps: 
il, 


15.7 


Choose a number of observations, P, for which you will generate pseudo out- 
of-sample forecasts; for example, P might be 10% or 20% of the sample size. 
Lets = T= P. 


. Estimate the forecasting regression using the estimation sample— that is, 


using observations t = 1,...,s. 


. Compute the forecast for the first period beyond this shortened sample, s + 1; 


call this Y,41)s. 


4. Compute the forecast error, u,+, = Y,41 — var 


5. Repeat steps 2 through 4 for the remaining periods, s = T — P + 1 to 


T — 1 (reestimate the regression for each period). The pseudo out-of-sample 
forecasts are Y,41)s,5 = T — P,...,T — 1, and the pseudo out-of-sample 
{orecast errors are Marin) ~ 1 = Pn TEL 


This method of estimating a model on a subsample of the data and then using 
that model to forecast on a reserved sample is called pseudo out-of-sample 
forecasting: out-of-sample because the observations being forecasted were not used 
for model estimation but pseudo because the reserved data are not truly out-of- 
sample observations. The construction of pseudo out-of-sample forecasts is summa- 
rized in Key Concept 15.7. 

With the resulting pseudo out-of-sample forecast errors u,,s = T—P+1,...,T 
in hand, the pseudo out-of-sample estimate of the MSFE is 


MSFEpoos = aa 5 Uş. (15.22) 


Compared to the squared SER estimate in Equation (15.20) and the final predic- 
tion error estimate in Equation (15.21), the pseudo out-of-sample estimate in 
Equation (15.22) has both advantages and disadvantages. The main advantage is that 
it does not rely on the assumption of stationarity, so that the conditional mean might 
differ between the estimation and the reserved samples. For example, the coefficients 
of the autoregression need not be the same in the two samples, and indeed the pseudo 
out-of-sample forecast error need not have mean 0. Thus any bias in the forecast 
arising because of a change in coefficients will be captured by MSFE poos but not by 
the other two estimators [which rely on Equation (15.19), which was derived under 
the assumption of stationarity]. Three disadvantages of the pseudo out-of-sample 
estimate are that it is more difficult to compute, that the estimate of the MSFE will 
have greater sampling variability than the other two estimates if Y is, in fact, 
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stationary (because MSFEpoos uses only P forecast errors), and that it requires 
choosing P. 

The choice of P entails a trade-off between the precision of the coefficient esti- 
mates and the number of observations available for estimating the MSFE. In prac- 
tice, choosing P to be 10% or 20% of the total number of observations can provide 
a reasonable balance between these two considerations. 


Application to GDP growth. For the AR(1) in Equation (15.9), RMSFE SER = 3.05, 
RMSEE gpr = 3.07, and RMSFE poos = 2.60 (computed over the final 44 quarters 
or 20% of the sample). For the AR(2) in Equation (15.11), RMSFE srr = 3.01, 
RMSFE gp = 3.03, and RMSFEpoos = 2.52. The FPE estimates are larger than the 
SER estimates because of the additional factor that estimates the variance from 
estimating the coefficients. The pseudo out-of-sample estimates of the RMSFE are 
smaller than the in-sample estimates. In part, this reflects the reduction in the 
variability of GDP growth that occurred in the early 1980s that is evident in 
Figure 15.1b, a phenomenon known as the Great Moderation. 


Forecast Uncertainty and Forecast Intervals 


In any estimation problem, it is good practice to report a measure of the uncertainty 
of that estimate, and forecasting is no exception. One measure of the uncertainty of 
a forecast is its root mean squared forecast error (RMSFE). Under the additional 
assumption that the errors u, are normally distributed, the estimates of the RMSFE 
introduced in Section 15.3 can be used to construct a forecast interval—that is, an 
interval that contains the future value of the variable with a certain probability. 


Forecast intervals. A forecast interval is like a confidence interval except that it 
pertains to a forecast. For example, a 95% forecast interval is an interval that con- 
tains the future value of the variable being forecasted in 95% of repeated 
applications. 

One important difference between a forecast interval and a confidence interval 
is that the usual formula for a 95% confidence interval (the estimator + 1.96 stan- 
dard errors) is justified by the central limit theorem and therefore holds for a wide 
range of distributions of the error term. In contrast, because the forecast error in 
Equation (15.15) includes the future value of the error ur+1, computing a forecast 
interval requires either estimating the distribution of the error term or making some 
assumption about that distribution. 

In practice, it is convenient to assume that uw; is normally distributed. Under the 
assumption of stationarity, the forecast error is the sum of ur+1 and a term reflecting 
the estimation error of the regression coefficients. In large samples, this second term 
is approximately normally distributed (by the central limit theorem) and is uncorre- 
lated with ur+1. Thus, if ur+1 is normally distributed, the forecast error is approxi- 
mately normally distributed and has a variance equal to the MSFE (Exercise 15.12). 
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Estimation of the MSFE and Forecast Intervals 


The River of Blood 


land regularly publishes forecasts of inflation. These 


s part of its efforts to inform the public about 


monetary policy decisions, the Bank of Eng- 


forecasts combine output from econometric models 
maintained by professional econometricians at the 
bank with the expert judgment of the members of the 
bank’s senior staff and Monetary Policy Committee. 
The forecasts are presented as a set of forecast inter- 
vals designed to reflect what these economists con- 
sider to be the range of probable paths that inflation 
might take. In its Inflation Report, the bank prints 
these ranges in red, with the darkest red reserved 
for the central band. Although the bank prosaically 
refers to this as the “fan chart,” the press has called 
these spreading shades of red the “river of blood.” 
The river of blood for February 2017 is shown 
in Figure 15.4. (In this figure, the blood is blue, not 
red, so you will need to use your imagination.) This 
chart shows that, as of February 2017, the bank’s 
economists expected the rate of inflation to rise 
from below its 2.0% target in early 2017 to 2.7% in 


the first quarter of 2018. The economists cited an 
expected strengthening of demand and a deprecia- 
tion in the British pound as reasons for the increase 
in the inflation rate. As it turned out, inflation rose 
over the next year by more than they had forecasted, 
to 3.0% in early 2018. 

The Bank of England has been a pioneer in the 
movement toward greater openness by central banks, 
and other central banks now also publish inflation 
forecasts. The decisions made by monetary policy 
makers are difficult ones and affect the lives—and 
wallets—of many of their fellow citizens. In a democ- 
racy in the information age, reasoned the economists 
at the Bank of England, it is particularly important 
for citizens to understand the bank’s economic out- 
look and the reasoning behind its difficult decisions. 

To see the river of blood in its original red hue, 
visit the Bank of England’s website, at http://www 
-bankofengland.co.uk. To learn more about the per- 
formance of the Bank of England inflation forecasts, 
see Clements (2004). 


( CAE the River of Blood 


The Bank of England's fan chart for 
February 2017 shows forecast ranges 


Percentage increase in prices 
on a year earlier 
6 


for inflation. 


Source: Reprinted with permission from the 
Bank of England. 
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The second two of the estimators of the MSFE, MSE rpg and MSE poos, incorpo- 
rate estimation error, and either one can be used to construct forecast intervals. 
That is, if ur}; is normally distributed, a 95% forecast interval is given by 
Yrair + 1.96 RMSE, where RMSE is either RMSEppr in Equation (15.21) or 
RMSE poos in Equation (15.22). 

This discussion has focused on the case that u, is homoskedastic. If instead it is 
heteroskedastic, then one needs to develop a model of the heteroskedasticity so that 
the term g% in Equation (15.19) can be estimated given the most recent values of Y 
and X. Methods for modeling this conditional heteroskedasticity are presented in 
Chapter 17 


Fan charts. To convey the full range of uncertainty about future values of a variable, 
professional forecasters sometimes report multiple forecast intervals. Taken together, 
multiple forecast intervals summarize the full distribution of future values of the 
variable. A forecast of the distribution of future values of a variable provides a great 
deal more information to consumers of forecasts than does a forecast of just its mean. 

Forecast distributions are frequently conveyed graphically in what is known as a 
fan chart. Fan charts portray the distribution at a future date by shaded overlaid fore- 
cast intervals, connected over an expanding forecast horizon. The Bank of England 
was one of the early users of fan charts as a way to convey forecast paths and uncer- 
tainty to the public and to monetary policy makers (see the box “The River of Blood”). 


Estimating the Lag Length Using 
Information Criteria 


The estimated GDP growth regressions in Sections 15.3 and 15.4 have either one or 
two lags of the predictors. Why not more lags? How many lags should you include in 
a time series regression? This section discusses statistical methods for choosing the 
number of lags, first in an autoregression and then in a time series regression model 
with multiple predictors. 


Determining the Order of an Autoregression 


In practice, choosing the order p of an autoregression requires balancing the marginal 
benefit of including more lags against the marginal cost of additional estimation 
error. On the one hand, if the order of an estimated autoregression is too low, you 
will omit potentially valuable information contained in the more distant lagged val- 
ues. On the other hand, if it is too high, you will be estimating more coefficients than 
necessary, which in turn introduces additional estimation error into your forecasts. 


The F-statistic approach. One approach to choosing p is to start with a model with 
many lags and to perform hypothesis tests on the final lag. For example, you might 
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start by estimating an AR(6) and test whether the coefficient on the sixth lag is signifi- 
cant at the 5% level; if not, drop it and estimate an AR(5), test the coefficient on the 
fifth lag, and so forth. The drawback to this method is that it will tend to produce large 
models: Even if the true AR order is five, so the sixth coefficient is 0,a 5% test using 
the t-statistic will incorrectly reject this null hypothesis 5% of the time just by chance. 
Thus, if the true value of p is five, this method will estimate p to be six 5% of the time. 


The BIC. One way around this problem is to estimate p by minimizing an informa- 
tion criterion. One such information criterion is the Bayes information criterion 
(BIC), also called the Schwarz information criterion (SIC), which is 


| + (p+ ya (15.23) 


SSR(p) 


BIC(p) = m| 


where SSR(p) is the sum of squared residuals of the estimated AR(p). The BIC 
estimator of p, p, is the value that minimizes BIC(p) among the possible choices 
Pp =0,1,...,Pmax Where Pmax is the largest value of p considered and p = 0 
corresponds to the model that contains only an intercept. 

The formula for the BIC might look a bit mysterious at first, but it has an intui- 
tive interpretation. Consider the first term in Equation (15.23). Because the regres- 
sion coefficients are estimated by OLS, the sum of squared residuals necessarily 
decreases (or at least does not increase) when you add a lag. In contrast, the second 
term is the number of estimated regression coefficients (the number of lags, p, plus 
one for the intercept) times the factor In(7)/T. This second term increases when you 
add a lag and thus provides a penalty for including another lag. The BIC trades off 
these two forces so that the number of lags that minimizes the BIC is a consistent 
estimator of the true lag length. Appendix 15.5 provides the mathematics of this 
argument. 

As an example, consider estimating the AR order for an autoregression of 
the growth rate of GDP. The various steps in the calculation of the BIC are 
carried out in Table 15.3 for autoregressions of maximum order six (Pmax = 6). For 
example, for the AR(1) model in Equation (15.8), [| SSR(1)/T] = 9.247, so 
In[SSR(1)/7T] = 2.224. Because T = 223 (1962:Q1-2017:Q3), In(T)/T = 0.024, 
and (p + 1)In(T)/T = 2 X 0.024 = 0.048. Thus BIC(1) = 2.224 + 0.048 = 2.273. 

The BIC is smallest when p = 2 in Table 15.3. Thus the BIC estimate of the lag 
length is 2. As can be seen in Table 15.3, as the number of lags increases, the R? 
increases, and the SSR decreases. The increase in the R? is large from zero to one lag, 
smaller for one to two lags, and smaller yet for other lags. The BIC helps decide pre- 
cisely how large the increase in the R? must be to justify including the additional lag. 


The AIC. Another information criterion is the Akaike information criterion (AIC): 


AIC(p) = m| OP] + (p+ Ne (15.24) 
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(TABLE 15.3 ) The Bayes Information Criterion (BIC) and the R? for Autoregressive D 
Models of U.S. GDP Growth Rates, 1962:Q1-2017:Q3 
p SSR(p)/T In[SSR(p)/T] (p + 1) In(T)/T BIC(p) R? 
0 10.477 2.349 0.024 2373 0.000 
1 9.247 2.224 0.048 2.273 0.117 
2 8.954 2.192 0.073 2.265 0.145 
3 8.954 2.192. 0.097 2.289 0.145 
4 8.920 2.188 0.121 2.310 0.149 
5 8.788 2.173 0.145 2.319 0.161 
N 6 8.779 2.172 0.170 2.342 0.162 
x 


The difference between the AIC and the BIC is that the term In(7) in the 
BIC is replaced by 2 in the AIC, so the second term in the AIC is smaller. For 
example, for the 223 observations used to estimate the GDP autoregressions, 
ln(T) = In(223) = 5.41,so the second term for the BIC is more than twice as large 
as the term in the AIC. Thus a smaller decrease in the SSR is needed in the AIC to 
justify including another lag. 

The AIC has an appealing motivation: In large samples, it corresponds to choos- 
ing p to minimize the MSFE as estimated by the final prediction error; that is, it mini- 
mizes MSFE; rpe in Equation (15.21).° However, as a matter of theory, the second term 
in the AIC is not large enough to ensure that the correct lag length is chosen, even in 
large samples, so the AIC estimator of p is not consistent. As is discussed in Appendix 15.5, 
in large samples the AIC will overestimate p with nonzero probability. 

Both the AIC and the BIC are widely used in practice. If you are concerned that 
the BIC might yield a model with too few lags, the AIC provides a reasonable 
alternative.® 
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Taking logarithms of the final expression yields In( MSFEppg) = In{ 1 + T In{ 1 T } 


SSR +1 SSR 
m( T ) a 2( T ) H m( T i where the final expression uses the approximation that 


In (1 + x) = x when x is small [Equation (8.16)]. The final expression is the AIC in Equation (15.24). 
The approximation MSFEppg ~ AIC holds when (p + 1)/T is small. 


H nae 
+1)/T] T 


‘Start with Equation (15.21) to write MSFEppg | = 
pi 


°The BIC and the AIC tackle the same problem— restricting the number of parameters to estimate —as 
the penalized least squares methods of ridge regression and the LASSO discussed in Sections 14.3 and 
14.4. One difference between the variable selection problem discussed in Chapter 14 and the lag selection 
problem discussed here is that, in the general prediction problem with cross-sectional data, there is no 
natural ordering of the potential regressors. In contrast, in the lag selection problem, it is natural to think 
that the first lag will be the most useful predictor, followed by the second lag, and so forth, so the predic- 
tors have a natural ordering. The AIC and the BIC exploit that natural ordering. 
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A note on calculating information criteria. For the AIC and BIC to decide between 
competing regressions with different numbers of lags, those regressions must be 
estimated using the same observations. For example, in Table 15.3 all the regressions 
were estimated using data from 1962:Q1 to 2017:Q3, for a total of 223 observations. 
Because the autoregressions involve lags of the GDP growth rate, this means that 
the regression uses earlier values of GDP growth (values before 1962:Q1) for initial 
observations. Said differently, each of the regressions examined in Table 15.3 includes 
observations on GDPGR,, GDPGR,_;,..., GDPGR,_, for t = 1962:Q]1,..., 
2017:Q3 corresponding to 223 observations on the dependent variable and regres- 
sors, so T = 223 in Equations (15.23) and (15.24). 


Lag Length Selection in Time Series Regression 
with Multiple Predictors 


The trade-off involved with lag length choice in the general time series regression 
model with multiple predictors [Equation (15.18)] is similar to that in an autoregres- 
sion: Using too few lags can decrease forecast accuracy because valuable information 
is lost, but adding lags increases estimation error. The choice of lags must balance the 
benefit of using additional information against the cost of estimating the additional 
coefficients. 


The F-statistic approach. As in the univariate autoregression, one way to deter- 
mine the number of lags is to use F-statistics to test joint hypotheses that sets of 
coefficients are equal to 0. For example, in the discussion of Equation (15.15), we 
tested the hypothesis that the coefficient on the second lag of the term spread was 
equal to 0 against the alternative that it was nonzero; this hypothesis was not 
rejected at the 10% significance level, suggesting that the second lag of the term 
spread could be dropped from the regression. If the number of models being com- 
pared is small, then this F-statistic method is easy to use. In general, however, the 
F-statistic method can produce models that are large and thus have considerable 
estimation error. 


Information criteria. As in an autoregression, the BIC and the AIC can be used to 
estimate the number of lags and variables in the time series regression model with 
multiple predictors. If the regression model has K coefficients (including the inter- 
cept), the BIC is 


BIC(K) = m O + = 


(15.25) 


The AIC is defined in the same way, but with 2 replacing In(7) in Equation (15.25). 
For each candidate model, the BIC (or the AIC) can be evaluated, and the model 
with the lowest value of the BIC (or the AIC) is the preferred model, based on the 
information criterion. 
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There are two important practical considerations when using an information 
criterion to estimate the lag lengths. First, as is the case for the autoregression, all the 
candidate models must be estimated over the same sample; in the notation of Equa- 
tion (15.25), the number of observations used to estimate the model, T, must be the 
same for all models. Second, when there are multiple predictors, this approach is 
computationally demanding because it requires computing many different models 
(many combinations of the lag parameters). In practice, a convenient shortcut is to 
require all the regressors to have the same number of lags—that is, to require that 
pP = q =+: = q,so that only Pmax + 1 models need to be compared (correspond- 
ing to p = 0,1,..., Pmax Applying this lag-length selection method to the ADL for 
GDP growth and the term spread results in the ADL(2, 2) model in Equation (15.15). 


Nonstationarity I: Trends 


In Key Concept 15.6, it was assumed that the dependent variable and the regressors 
are stationary. If this is not the case—that is, if the dependent variable and/or the 
regressors are nonstationary —then conventional hypothesis tests, confidence inter- 
vals, and forecasts can be unreliable. The precise problem created by nonstationarity, 
and the solution to that problem, depends on the nature of that nonstationarity. 

In this and the next section, we examine two types of nonstationarity that are 
frequently encountered in economic time series: trends and breaks. In each section, 
we first describe the nature of the nonstationarity and then discuss the consequences 
for time series regression if this type of nonstationarity is present but ignored. We 
next present tests for nonstationarity and discuss remedies for, or solutions to, the 
problems caused by that particular type of nonstationarity. We begin by discussing 
trends. 


What Is a Trend? 


A trend is a persistent long-term movement of a variable over time. A time series 
variable fluctuates around its trend. 

Inspection of Figure 15.1a suggests that the logarithm of U.S. GDP has a clear 
upwardly increasing trend. The series in Figures 15.2a, 15.2b, and 15.2c also have 
trends, but their trends are quite different. The trend in the unemployment rate is 
increasing from the late 1960s through the early 1980s, then decreasing until the early 
2000s, and then increasing again. The $/£ exchange rate clearly had a prolonged 
downward trend after the collapse of the fixed exchange rate system in 1972. The 
logarithm of the Japan Industrial Production Index has a complicated trend: fast 
growth at first, then moderate growth, and finally no growth. 


Deterministic and stochastic trends. There are two types of trends in time series 
data: deterministic and stochastic. A deterministic trend is a nonrandom function of 
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time. For example, a deterministic trend might be linear in time; if the logarithm of 
US. GDP had a deterministic linear trend, so that it increased by 0.75 percentage 
points per quarter, this trend could be written as 0.75t, where tis measured in quar- 
ters. In contrast, a stochastic trend is random and varies over time. For example, a 
stochastic trend might exhibit a prolonged period of increase followed by a pro- 
longed period of decrease, like the unemployment rate trend in Figure 15.2a. But 
stochastic trends can be more subtle. For example, if you look carefully at 
Figure 15.1a, you will notice that the trend growth rate of GDP is not constant; for 
example, GDP grew faster in the 1960s than in the 1970s (the plot is steeper in the 
1960s than in the 1970s), and it grew faster in the 1990s than in the 2000s. 

Like many econometricians, we think it is more appropriate to model economic 
time series as having stochastic rather than deterministic trends. It is hard to recon- 
cile the predictability implied by a deterministic trend with the complications and 
surprises faced year after year by workers, businesses, and governments. For example, 
although the U.S. unemployment rate rose through the 1970s, it was neither destined 
to rise forever nor destined to fall again. Rather, the slow rise of unemployment rates 
is now understood to have occurred because of a combination of demographic 
changes (including an influx of younger workers), bad luck (such as oil price shocks 
and a productivity slowdown), and monetary policy mistakes. Similarly, the $/£ 
exchange rate trended down from 1972 to 1985 and subsequently drifted up, but these 
movements, too, were the consequences of complex economic forces; because these 
forces change unpredictably, these trends are usefully thought of as having a large 
unpredictable, or random, component. 

For these reasons, our treatment of trends in economic time series focuses on 
stochastic rather than deterministic trends, and when we refer to “trends” in time 
series data, we mean stochastic trends unless we explicitly say otherwise. 


The random walk model of a trend. The simplest model of a variable with a stochas- 
tic trend is the random walk. A time series Y, is said to follow a random walk if the 
change in Y, is i.i.d.—that is, if 


Y, = Yi +u, (15.26) 


where u, is 1.1.d. We will, however, use the term random walk more generally to refer 
to a time series that follows Equation (15.26), where u, has conditional mean 0; 
that is, E(u,| Y;-1,Y;-2,...) = 0. Another term for a time series for which 
E(AY, 

The basic idea of a random walk is that the value of the series tomorrow is 


Y,-1,Y;-2, ...) = 0 is a martingale. 


its value today plus an unpredictable change: Because the path followed by Y, 
consists of random “steps” u, that path is a “random walk.” The conditional 
mean of Y, based on data through time t— 1 is Y,_,; that is, because 
E(u| Y%-1, Y-2,...) = 0, E(¥,| ¥-1, Y-2,...) = Y-1. In other words, if Y, 
follows a random walk, then the best forecast of tomorrow’s value is its value today. 
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If Y, follows a random walk, its variance increases over time. Because it does not 
have a constant variance, a random walk is nonstationary (Exercise 15.13). 

Some series, such as the logarithm of U.S. GDP in Figure 15.1a, have an obvious 
upward tendency, in which case the best forecast of the series must include an adjust- 
ment for the tendency of the series to increase. This adjustment leads to an extension 
of the random walk model to include a tendency to move, or drift, in one direction 
or the other. This extension is referred to as a random walk with drift: 


Y, = Po + Y%-1 + up (15.27) 


where E(u;,| Y;-1,Y;-2,...) = 0 and f is the drift in the random walk. If Bp is 
positive, then Y, increases on average. In the random walk with drift model, the best 
forecast of the series tomorrow is the value of the series today plus the drift Bo. 

The random walk model (with drift, as appropriate) is simple yet versatile, and 
it is the primary model for trends used in this book. 


Stochastic trends, autoregressive models, and a unit root. The random walk model 
is a special case of the AR(1) model [Equation (15.8)] in which 6; = 1. In other 
words, if Y, follows an AR(1) with 6, = 1, then Y, contains a stochastic trend and is 


nonstationary. If, however, | 6, | < 1 and w, is stationary, then the joint distribution 
of Y, and its lags does not depend on ¢ (a result shown in Appendix 15.2), so Y, is 
stationary. 

The analogous condition for an AR(p) to be stationary is more complicated than 


the condition | 6, | < 1 for an AR(1). Its formal statement involves the roots of the 


polynomial, 1 — Biz — Bz” — B3z° — «++ — Bz’. (The roots of this polynomial are 
the values of z that satisfy 1 — Bız — xz’ — 3z? — +++ — B,z? = 0.) For an AR(p) 
to be stationary, the roots of this polynomial must all be greater than 1 in absolute 


value. In the special case of an AR(1), the root is the value of z that solves 1 — B,z = 0, 
so its root is z = 1/64. Thus the statement that the root must be greater than 1 in 
absolute value is equivalent to | B, | < 1. 

If an AR(p) has a root that equals 1, the series is said to have a unit autoregres- 
sive root or, more simply, a unit root. If Y, has a unit root, then it contains a stochastic 
trend. If Y, is stationary (and thus does not have a unit root), it does not contain a 
stochastic trend. For this reason, we will use the terms stochastic trend and unit root 
interchangeably. 


Problems Caused by Stochastic Trends 


If a regressor has a stochastic trend (that is, has a unit root), then inferences made 
using the OLS estimator of the autoregressive coefficient can be misleading. More- 
over, two series that are independent but have stochastic trends will, with high 
probability, misleadingly appear to be related, a situation known as spurious 
regression. 
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Downward bias and nonnormal! distributions of the OLS estimator and t-statistic. If 
a regressor has a stochastic trend, then its usual OLS t-statistic can have a nonnormal 
distribution under the null hypothesis, even in large samples, and the estimate of the 
autoregressive coefficient is biased toward 0. This nonnormal distribution means that 
conventional confidence intervals are not valid and hypothesis tests cannot be con- 
ducted as usual. 

The downward bias of the OLS estimator poses a problem for forecasts. Recall 
that the oracle forecast is the conditional mean. If the coefficient in an AR(1) model 
of the conditional mean is 1 (a unit root), then the OLS estimator will tend to take 
on a value less than 1, and its sampling distribution has a mean that is less than 1. In 
a forecasting application, this can lead to systematic bias in the forecast. Moreover, 
because the distribution of the t-statistic testing that coefficient is not normal, even 
in large samples, standard inferences based on that t-statistic will not detect this mis- 
take of downward-biased forecasts. Fortunately, as is discussed later in this section, 
there are ways to detect whether a series has a unit root and thus to avoid these 
problems. 


Spurious regression. Stochastic trends can lead two time series to appear related 
when they are not, a problem called spurious regression. 

For example, the U.S. unemployment rate was steadily rising from the mid-1960s 
through the early 1980s, and at the same time, Japanese industrial production (plotted 
in logarithms in Figure 15.2c) was steadily rising. These two trends conspire to 
produce a regression that appears to be “significant” using conventional measures. 
Estimated by OLS using data from 1962 through 1985, this regression is 


U. S. Unemployment Rate, = —2.37 + 2.22 X In(Japanese IP,), R = 0.34. 
(1.19) (0.32) (15.28) 


The t-statistic on the slope coefficient is 7, which by usual standards indicates a strong 
positive relationship between the two series, and the R? is moderately high. However, 
running this regression using data from 1986 through 2017 yields 


——— SSS = 
U. S. Unemployment Rate, = 42.37 — 7.92 X In(Japanese IP,), R? = 0.14. 
(7.74) (1.69) (15.29) 


The regressions in Equations (15.28) and (15.29) could hardly be more different. 
Interpreted literally, Equation (15.28) indicates a strong positive relationship, while 
Equation (15.29) indicates a negative relationship. 

The source of these conflicting results is that both series have stochastic trends. 
These trends happened to align from 1962 through 1985 but were reversed from 1986 
through 2017 There is, in fact, no compelling economic or political reason to think that 
the trends in these two series are related. In short, these regressions are spurious. 

The regressions in Equations (15.28) and (15.29) illustrate empirically the 
theoretical point that OLS can be misleading when the series contain stochastic 
trends. (See Exercise 15.6 for a computer simulation that demonstrates this result.) 
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One special case in which certain regression-based methods are reliable is when the 
trend component of the two series is the same—that is, when the series contain a 
common stochastic trend; in such a case, the series are said to be cointegrated. Econo- 
metric methods for detecting and analyzing cointegrated economic time series are 
discussed in Chapter 17. 


Detecting Stochastic Trends: Testing for a Unit AR Root 


The starting point for detecting a trend in a time series is inspecting its time series 
plot. If the series looks like it might have a trend, the hypothesis that it has a stochas- 
tic trend can be tested using a Dickey—Fuller test. 


The Dickey-Fuller test in the AR(1) model. The random walk in Equation (15.27) is 
a special case of the AR(1) model with 6, = 1. Thus, when Y, follows an AR(1), the 
hypothesis that Y, has a stochastic trend corresponds to 


Ay: By = 1 vs. Hy: By < 1, where Y, = Bo + Bi Y;-1 + up (15.30) 


The null hypothesis in Equation (15.30) is that the AR(1) has a unit root, and the 
one-sided alternative is that it is stationary. 

This test is most easily implemented by estimating a modified version of Equa- 
tion (15.30), obtained by subtracting Y,_,; from both sides. Let 6 = B, — 1; then 
Equation (15.30) becomes 


Hp: 6 = Ovs. Hi: 6 < 0, where AY, = By + 6Y,_1 + u, (15.31) 


The OLS t-statistic testing 6 = 0 in Equation (15.31) is called the Dickey—Fuller 
statistic [Dickey and Fuller (1979)]. The Dickey—Fuller statistic is computed using 
nonrobust standard errors—that is, the homoskedasticity-only standard errors, 
presented in Appendix 5.1.’ 


Critical values for the ADF statistic. Under the null hypothesis of a unit root, 
the Dickey—Fuller statistic does not have a normal distribution, even in large 
samples. Because its distribution is nonnormal, a different set of critical values 
is required. 

The critical values for the ADF test of the null and alternative hypotheses in 
Equation (15.31) are given in the first row of Table 15.4. Because the alternative 
hypothesis of stationarity implies that 6 < 0 in Equation (15.31), the ADF test is 
one-sided. For example, if the regression does not include a time trend, then the 
hypothesis of a unit root is rejected at the 5% significance level if the ADF statistic 
is less than —2.86. 


"Under the null hypothesis of a unit root, the usual nonrobust standard errors produce a t-statistic that is, 
in fact, robust to heteroskedasticity, a surprising and special result. 
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Large-Sample Critical Values of the Augmented Dickey-Fuller Statistic 


Deterministic Regressors 10% 5% 1% 
Intercept only —2.57 —2.86 —3.43 
Intercept and time trend =3.12 —3.41 —3.96 


The critical values in Table 15.4 are substantially larger (more negative) than the 
one-sided critical values of —1.28 (at the 10% level) and —1.64 (at the 5% level) from 
the standard normal distribution. The nonstandard distribution of the ADF statistic 
is an example of how OLS tstatistics for regressors with stochastic trends can have 
nonnormal distributions. 


The Dickey-Fuller test in the AR(p) model. The Dickey—Fuller statistic in Equation (15.31) 
applies to first-order autoregression. The extension of the Dickey—Fuller test to the 
AR(p) model entails including p — 1 lags of AY, as additional regressors, so that 
Equation (15.31) becomes 


AY, = Bo + OY 9 + YAN + HAY -2 Fo Yp-1AY -p1 Fup (15.32) 


Under the null hypothesis that 6 = 0, Y, has a stochastic trend; under the alternative 
hypothesis that ô < 0, Y, is stationary. The t-statistic testing the hypothesis that 
A = Oin Equation (15.32) is called the augmented Dickey-—Fuller (ADF) statistic. In 
general, the lag length p is unknown, but it can be estimated using an information 
criterion applied to regressions of the form in Equation (15.32) for various values of p. 
Studies of the ADF statistic suggest that it is better to have too many lags than too 
few, so it is recommended to use the AIC instead of the BIC to estimate p for the 
ADF statistic.* 


Testing against the alternative of stationarity around a linear deterministic 
time trend. The discussion so far has considered the null hypothesis that a series has 
a unit root and the alternative hypothesis that it is stationary. This alternative hypoth- 
esis of stationarity is appropriate for series such as the unemployment rate that do 
not exhibit growth over the long run. But for series such as U.S. GDP, the alternative 
of stationarity around a constant mean is inappropriate, and it makes more sense to 
test for stationarity around a deterministic trend. One specific formulation of this 
alternative hypothesis is that the trend is a linear function of t. Thus the null hypoth- 
esis is that the series has a unit root, and the alternative is that it does not have a unit 
root but does have a deterministic time trend. 


See Stock (1994) and Haldrup and Jansson (2006) for reviews of simulation studies of the finite-sample 
properties of the Dickey—Fuller and other unit root test statistics. 
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If the alternative hypothesis is that Y, is stationary around a deterministic linear 
time trend, then this trend, t (the observation number), must be added as an addi- 
tional regressor, in which case the Dickey—Fuller regression becomes 


AY, = By + at + SY,-1 + YAY, -1 + yoAY-2 + °° + Yp-14AY,-p+1 + uy, (15.33) 


where a is an unknown coefficient. The ADF statistic now is the OLS t-statistic test- 
ing 6 = 0 in Equation (15.33), and the one-sided critical values are given in the sec- 
ond row of Table 15.4.’ 


Does U.S. GDP have a stochastic trend? The null hypothesis that the logarithm of 
U.S. GDP has a stochastic trend can be tested against the alternative that it is station- 
ary by performing the ADF test for a unit autoregressive root. The ADF regression 
with two lags of Aln(GDP,) is 


Aln(GDP,) = 0.162 + 0.0001f — 0.019 In(GDP,_,) 


(0.080) (0.0001) (0.010) 
(15.34) 
+ 0.261Aln(GDP,_;) + 0.165AIn(GDP__»). 


(0.066) (0.066) 


The ADF t-statistic is the t-statistic testing the hypothesis that the coefficient on In(GDP,_,) 
is 0; this is,t = —1.95. From Table 15.4, the 10% critical value is —3.12. Because the ADF 
statistic of —1.95 is less negative than —3.12, the test does not reject the null hypothesis at 
the 10% significance level. Based on the regression in Equation (15.34), we therefore can- 
not reject (at the 10% significance level) the null hypothesis that the logarithm of GDP 
has a unit autoregressive root — that is, that In(GDP) has a stochastic trend—against the 
alternative that it is stationary around a linear trend. 


Avoiding the Problems Caused by Stochastic Trends 


The most reliable way to handle a trend in a series is to transform the series so that 
it does not have the trend. If the series has a stochastic trend, then its difference does 
not. For example, if Y, follows a random walk, so that Y, = By) + Y,;_; + u, then 
AY, = Bo + u, is stationary. Thus using first differences eliminates random walk 
trends in a series. 

In practice, you can rarely be sure whether a series has a stochastic trend. Recall 
that, as a general point, failure to reject the null hypothesis does not necessarily mean 
that the null hypothesis is true; rather, it simply means that you have insufficient evi- 
dence to conclude that it is false. Thus failure to reject the null hypothesis of a unit root 
using the ADF statistic does not mean that the series actually has a unit root. Even 
though failure to reject the null hypothesis of a unit root does not mean the series has 


°For extensions of the Dickey—Fuller test to nonlinear time trends, see Maddala and Kim (1998). 
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a unit root, it still can be reasonable to approximate the true autoregressive root as 
equaling 1 and therefore to use differences of the series rather than its levels.!° 


Nonstationarity II: Breaks 


A second type of nonstationarity arises when the population regression function 
changes over the course of the sample. In economics, this can occur for a variety of 
reasons, such as changes in economic policy, changes in the structure of the economy, 
or changes in a specific industry due to an invention. If such changes, or breaks, occur, 
then a regression model that neglects those changes can provide a misleading basis 
for inference and forecasting. It is therefore important to check a forecasting model 
for breaks and to adjust the model if one is found. 


What Is a Break? 


Breaks can arise either from a discrete change in the population regression coefficients at 
a distinct date or from a gradual evolution of the coefficients over a longer period of time. 

One source of discrete breaks in macroeconomic data is a major change in mac- 
roeconomic policy. For example, the breakdown of the Bretton Woods system of 
fixed exchange rates in 1972 produced the break in the time series behavior of the 
$/£ exchange rate that is evident in Figure 15.2b. Prior to 1972, the exchange rate was 
essentially constant, with the exception of a single devaluation in 1968, when the 
official value of the pound relative to the dollar was decreased. In contrast, since 1972 
the exchange rate has fluctuated over a very wide range. 

Breaks also can occur more slowly, as the population regression evolves over 
time. For example, such changes can arise because of slow evolution of economic 
policy and ongoing changes in the structure of the economy. The methods for detect- 
ing breaks described in this section can detect both types of breaks: distinct changes 
and slow evolution. 


Problems caused by breaks. If a break occurs in the population regression function 
during the sample, then the OLS regression estimates over the full sample will esti- 
mate a relationship that holds on average in the sense that the estimate combines the 
two different periods. Depending on the location and the size of the break, the “aver- 
age” regression function can be quite different from the true regression function at 
the end of the sample, and this leads to poor forecasts. 


Testing for Breaks 

One way to detect breaks is to test for discrete changes, or breaks, in the regression 
coefficients. How this is done depends on whether the break date (the date of the 
suspected break) is known. 


10For additional discussion of stochastic trends in economic time series variables and of the problems they 
pose for regression analysis, see Stock and Watson (1988). 
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Testing for a break at a known date. In some applications, you might suspect that 
there is a break at a known date. For example, if you are studying international trade 
relationships using data from the 1970s, you might hypothesize that there is a break 
in the population regression function of interest in 1972, when the Bretton Woods 
system of fixed exchange rates was abandoned in favor of floating exchange rates. 

If the date of the hypothesized break in the coefficients is known, then the null 
hypothesis of no break can be tested using a binary variable interaction regression 
(Key Concept 8.4). To keep things simple, consider an ADL(1, 1) model, so there is 
an intercept, a single lag of Y, and a single lag of X, Let r denote the hypothesized 
break date, and let D,(7) be a binary variable that equals 0 before the break date and 
1 after, so D,(7) = Oift = rand Dr) = Lift > 7. Then the regression including 
the binary break indicator and all interaction terms is 


Y, = Bo + BiY-1 + 6:X)-1 + yoD7) + yl DT) X Y-1] 


+ y| D (T) X X_1] + u, (15.35) 


If there is not a break, then the population regression function is the same over both 
parts of the sample, so the terms involving the break binary variable D,(7) do not 
enter Equation (15.35). That is, under the null hypothesis of no break, 
Yo = ¥1 = y2 = 0. Under the alternative hypothesis that there is a break, the popu- 
lation regression function is different before and after the break date 7, in which case 
at least one of the y’s is nonzero. Thus the hypothesis of a break can be tested using 
the F-statistic that tests the hypothesis that yọ = yı = y2 = 0 against the hypothesis 
that at least one of the y’s is nonzero. This is often called a Chow test for a break at 
a known break date, named for its inventor, Gregory Chow (1960). 

If there are multiple predictors or more lags, then this test can be extended by 
constructing binary variable interaction variables for all the regressors and testing the 
hypothesis that all the coefficients on terms involving D,(7) are 0. 

This approach can be modified to check for a break in a subset of the coefficients by 
including only the binary variable interactions for that subset of regressors of interest. 


Testing for a break at an unknown date. Often the date of a possible break is 
unknown or known only within a range. Suppose, for example, you suspect that a 
break occurred sometime between two dates, 7 and 7,. The Chow test can be 
extended to handle this situation by testing for breaks at all possible dates 7 between 
7) and 7, and then using the largest of the resulting F-statistics to test for a break at 
an unknown date. This modified Chow test is variously called the Quandt likelihood 
ratio (QLR) statistic (Quandt 1960) (the term we shall use) or, more obscurely, the 
sup-Wald statistic. 

Because the QLR statistic is the largest of many F-statistics, its distribution is not 
the same as an individual F-statistic. Instead, the critical values for the OLR statistic 
must be obtained from a special distribution. Like the F-statistic, this distribution 
depends on the number of restrictions being tested, g—that is, the number of coef- 
ficients (including the intercept) that are being allowed to break, or change, under 
the alternative hypothesis. The distribution of the QLR statistic also depends on 
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7 /T and 7, / T— that is, on the endpoints, 7) and 7,, of the subsample over which the 
F-statistics are computed, expressed as a fraction of the total sample size. 

For the large-sample approximation to the distribution of the QLR statistic to 
be a good one, the subsample endpoints, 7) and 7;, cannot be too close to the begin- 
ning or the end of the sample. For this reason, in practice the QLR statistic is com- 
puted over a “trimmed” range, or subset, of the sample. A common choice is to use 
15% trimming —that is, to set 7] = 0.157 and 7, = 0.857 (rounded to the nearest 
integer). With 15% trimming, the F-statistic is computed for break dates in the cen- 
tral 70% of the sample. 

The critical values for the QLR statistic, computed with 15% trimming, are given 
in Table 15.5. Comparing these critical values with those of the F,.. distribution 
(Appendix Table 4) shows that the critical values for the QLR statistics are larger. 


Critical Values of the QLR Statistic with 15% Trimming D 
Number of Restrictions (q) 10% 5% 1% 
1 712 8.68 12.16 
2 5.00 5.86 7.18 
3 4.09 4.71 6.02 
4 3.59 4.09 5.12 
5 3.26 3.66 4.53 
6 3.02 3.37 4.12 
7 2.84 3.15 3.82 
8 2.69 2.98 3.57 
9 2.58 2.84 3.38 
10 2.48 2.71 3.23 
11 2.40 2.62 3.09 
12 2.33 2.54 2.97 
13 2.27 2.46 2.87 
14 2.21 2.40 2.78 
15 2.16 2.34 2.71 
16 2.12 2.29 2.64 
17 2.08 2.25 2.58 
18 2.05 2.20 2.53 
19 2.01 2.17 2.48 
20 1.99 2.13 2.43 
Note: These critical values apply when 7) = 0.15T and 7; = 0.85T (rounded to the nearest integer), so the 
F-statistic is computed for all potential break dates in the central 70% of the sample. The number of restrictions 
qis the number of restrictions tested by each individual F-statistic. Critical values for other trimming percentages 
[are given in Andrews (2003). J 
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The QLR Test for Coefficient Stability 


Let F(7) denote the F-statistic testing the hypothesis of a break in the regression 
coefficients at date 7; in the regression in Equation (15.35), for example, this is 
the F-statistic testing the null hypothesis that yọ = y4 = y2 = 0. The QLR (or 
sup-Wald) test statistic is the largest of the F-statistics in the range 7 = 7 = 7%: 


OLR = max[F(m), F(m + 1),...,F(1)]. (15.36) 


1. Like the F-statistic, the QLR statistic can be used to test for a break in all or 
just some of the regression coefficients. 


2. In large samples, the distribution of the QLR statistic under the null hypoth- 
esis depends on the number of restrictions being tested, g, and on the end- 
points 7 and 7, as a fraction of T. Critical values are given in Table 15.5 for 
15% trimming (7 = 0.15T and 7, = 0.857, rounded to the nearest integer). 


3. The QLR test can detect a single discrete break, multiple discrete breaks, 
and/or slow evolution of the regression function. 


4. If there is a distinct break in the regression function, the date at which the 
largest Chow statistic occurs is an estimator of the break date. 


This reflects the fact that the QLR statistic looks at the largest of many individual 
F-statistics. By examining F-statistics at many possible break dates, the QLR statistic 
has many opportunities to reject the null hypothesis, leading to QLR critical values 
that are larger than the individual F-statistic critical values. 

The QLR test can be used to test for a break in only some of the regression 
coefficients by using interactions between the date binary indicators and only 
the variables in question, and then computing the largest of the resulting 
F-statistics. The critical values for this version of the QLR test are also taken 
from Table 15.5, where the number of restrictions (q) is the number of restric- 
tions tested. 

If there is a discrete break at a date within the range tested, the date at which the 
constituent F-statistic is at its maximum, 7, is an estimate of the break date r. 

The QLR statistic also rejects the null hypothesis with high probability in large 
samples when there are multiple discrete breaks or when the break comes in the 
form of a slow evolution of the regression function. This means that the QLR statistic 
detects forms of instability other than a single discrete break. As a result, if the QLR 
statistic rejects the null hypothesis, it can mean that there is a single discrete break, 
that there are multiple discrete breaks, or that there is slow evolution of the regres- 
sion function. 

The QLR statistic is summarized in Key Concept 15.8. 


Na 
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Warning: You probably don’t know the break date even if you think you 
do. Sometimes an expert might believe that he or she knows the date of a possible 
break, so that the Chow test can be used instead of the QLR test. But if this knowl- 
edge is based on the expert’s knowledge of the series being analyzed, then, in fact, 
this date was estimated using the data, albeit in an informal way. Preliminary estima- 
tion of the break date means that the usual F critical values cannot be used for the 
Chow test for a break at that date. Thus it remains appropriate to use the QLR sta- 
tistic in this circumstance. 


Application: Has the predictive power of the term spread been stable? The QLR test 
provides a way to check whether the GDP-term spread relation has been stable from 
1962 to 2017. Specifically, we focus on whether there have been changes in the 
coefficients on the lagged values of the term spread and the intercept in the ADL(2,2) 
specification in Equation (15.15), containing two lags each of GDPGR, and TSpread,. 

The Chow F-statistics testing the hypothesis that the intercept and the coeffi- 
cients on TSpread,_,, TSpread,_z, and the intercept in Equation (15.15) are constant 
against the alternative that they break at a given date are plotted in Figure 15.5 for 
breaks in the central 70% of the sample. For example, the F-statistic testing for a 
break in 1975:Q1 is 2.07, the value plotted at that date in the figure. Each F-statistic 
tests three restrictions (no change in the intercept and in the two coefficients on lags 
of the term spread), so q = 3. The largest of these F-statistics is 6.47, which occurs in 
1980:Q4; this is the QLR statistic. Comparing 6.47 to the critical values for q = 3 in 
Table 15.5 indicates that the hypothesis that these coefficients are stable is rejected 
at the 1% significance level. (The 1% critical value is 6.02.) Thus, there is statistically 
significant evidence that at least one of these coefficients changed over the sample. 


p 
GOTA -Statistics Testing for a Break in Equation (15.15) at Different Dates 


1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 2017 


At a given break date, the F-statistic plotted here tests the null hypothesis of a break in at least one of the coefficients 
on TSpread,_;, TSpread,-2, or the intercept in Equation (15.15). For example, the F-statistic testing for a break in 1975:Q1 
is 2.07. The QLR statistic, 6.47, is the largest of these F-statistics and exceeds the 1% critical value of 6.02. 


r QLR Statistic = 6.47 
1% Critical Value 


5% Critical Value 
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Detecting Breaks Using Pseudo 
Out-of-Sample Forecasts 


The ultimate test of a forecasting model is its out-of-sample performance — that is, its 
forecasting performance in “real time,” after the model has been estimated. Pseudo 
out-of-sample forecasting, introduced in Key Concept 15.7 for the purpose of 
estimating the MSFE, simulates the real-time performance of a forecasting model 
and can be used to detect breaks near the end of the sample. 

The most direct and often most useful way to do so is via a time series plot of the 
in-sample predicted values, the pseudo out-of-sample forecasts, and the actual values 
of the series. A visible deterioration of the forecasts in the pseudo out-of-sample 
period is a red flag warning of a possible breakdown of the forecasting model. 

. —— n . ee a . 
Another check is to compare MSFEpoos with MSFEprpr, where MSFE pp, is com- 
aa 
puted on the same estimation sample as used for MSFEpoos (the first T- P observa- 
tions). If the series is stationary, these two estimates of the MSFE should be 
. a . a 
numerically close. A value of MSFEpoos that is much larger than MSFE ppg suggests 
some violation of stationarity, possibly a breakdown of the forecasting equation. 


Application: Did the predictive power of the term spread change during the 
2000s? Using the QLR statistic, we rejected the null hypothesis that the predictive 
power of the term spread has been stable against the alternative of a break at the 1% 
significance level, with a break occurring in the early 1980s. Does the ADL(2, 2) 
model provide a stable forecasting model subsequent to the 1980:04 break? 

If the coefficients of the ADL(2, 2) model changed toward the end of the 
1981:Q1-2017:Q3 period, then pseudo out-of-sample forecasts computed using an 
estimation sample starting in 1981:Q1 should deteriorate. The pseudo out-of-sample 
forecasts of the growth rate of GDP for the period 2003:Q1-2017:Q3, computed 
using the estimation sample of 1981:Q1—2002:Q4 and the method of Key Concept 5.7, 
are plotted in Figure 15.6, along with the actual values of the growth rate of GDP. The 
pseudo out-of-sample forecast errors are the differences between the actual growth 
rate of GDP and its pseudo out-of-sample forecast —that is, the differences between 
the two lines in Figure 15.6. For example, in 2006:Q4, the growth rate of GDP was 3.1 
percentage points (at an annual rate), but the pseudo out-of-sample forecast of 
GDPGR 006.94 Was 1.6 percentage points, so the pseudo out-of-sample forecast error 
was GDPGR2 06:04 — GD PGRo006:04|2006:03 = 1.5 percentage points. In other words, 
a forecaster using the ADL(2, 2) estimated through 2006:Q3 would have forecasted 
GDP growth of 1.6 percentage points in 2006:Q4, whereas in reality GDP grew by 
3.1 percentage points. 

How do the mean and standard deviation of the pseudo out-of-sample forecast 
errors compare with the in-sample fit of the model? If the forecasting model is 
stable, the pseudo out-of-sample forecast errors should have mean 0. However, 
over the 2003:Q1-2017:Q4 pseudo out-of-sample forecast period, the average fore- 
cast error is —0.57, and the t-statistic testing the hypothesis that the mean forecast 
error equals 0 is —2.00; thus the hypothesis that the forecasts have mean 0 is rejected 
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| FIGURE 15.6 | U.S. GDP Growth Rates and Pseudo Out-of-Sample Forecasts 
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at the 5% significance level. That said, RMSFE ppp = 2.45 (1981:01-2002:04) and 
RMSFEpoos = 2.29 (2003:Q1-2017:Q4), indicating a slight improvement of the 
forecast in the out-of-sample period. Figure 15.6 shows that the pseudo out-of- 
sample forecasts track actual GDP growth reasonably well except during late 2008 
and early 2009, the period of steepest decline of GDP during the financial crisis 
and its immediate aftermath. Excluding the single quarter 2008:04 lowers 
RMSFEpoos from 2.29 to 1.85. 

According to the pseudo out-of-sample forecasting exercise, the performance of 
the ADL(2, 2) forecasting model during the pseudo out-of-sample period 
2003:Q1-2017:04 was, with the exception of the sharp decline in GDP in late 2008, 
better than its performance during the in-sample period of 1981:Q1-2002:04.!! 


Avoiding the Problems Caused by Breaks 


How best to adjust for a break in the population regression function depends on the 
source of that break. If a distinct break occurs at a specific date, that break will be 
detected with high probability by the QLR statistic, and the break date can be esti- 
mated. The regression function can then be reestimated using a binary variable indi- 
cating the two subsamples associated with this break and including interactions with 
the other regressors as appropriate. If all the coefficients break, then this simplifies 
to reestimating the regression using the post-break data. If there is, in fact, a distinct 
break, then subsequent inference on the regression coefficients can proceed as 
usual—for example, using normal critical values for hypothesis tests based on 


"The ADL(2, 2) was not alone in failing to forecast GDP growth in 2008:Q4. Researchers at the Federal 
Reserve Bank of Philadelphia surveyed 47 professional forecasters in the third quarter of 2008 and asked 
for their forecasts of the growth rate of GDP in the fourth quarter. The median of the 47 forecasts was 
0.7%, lower than the ADL(2, 2) forecast of 2.0%. The actual growth rate of GDP in 2008:04 was -8.5%. 
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t-statistics. In addition, forecasts can be produced using the regression function esti- 
mated using the post-break model. 

If the break is not distinct but rather arises from a slow, ongoing change in the 
parameters, the remedy is more difficult and goes beyond the scope of this book.” 


Conclusion 


In time series data, a variable generally is correlated from one observation, or date, to 
the next. A consequence of this correlation is that linear regression can be used to fore- 
cast future values of a time series based on its current and past values. The starting point 
for time series regression is an autoregression, in which the regressors are lagged values 
of the dependent variable. If additional predictors are available, then their lags can be 
added to the regression. This chapter has described methods for specifying and estimat- 
ing forecasting regressions, for selecting among competing forecasting regressions, for 
handling trends in the data, and for assessing the stability of forecasting models. 

The time series regressions in this chapter were developed for forecasting, and 
in general, the coefficients do not have a causal interpretation. In some applications, 
however, the task is not to develop a forecasting model but rather to estimate causal 
relationships among time series variables—that is, to estimate the dynamic causal 
effect on Y over time of a change in X. Under the right conditions, the methods of 
this chapter, or closely related methods, can be used to estimate dynamic causal 
effects, and that is the topic of the next chapter. 


Summary 


1. Regression models used for forecasting need not have a causal interpretation. 

2. A time series variable generally is correlated with one or more of its lagged 
values; that is, it is serially correlated. 

3. The accuracy of a forecast is measured by its mean squared forecast error. 

4. An autoregression of order p is a linear multiple regression model in which the 
regressors are the first p lags of the dependent variable. The coefficients of an 
AR(p) can be estimated by OLS, and the estimated regression function can 
be used for forecasting. The lag order p can be estimated using an information 
criterion such as the BIC or the AIC. 

5. Adding other variables and their lags to an autoregression can improve fore- 
casting performance. Under the least squares assumptions for prediction with 
time series regression (Key Concept 15.6), the OLS estimators have normal 
distributions in large samples, and statistical inference proceeds the same way 
as for cross-sectional data. 


For additional discussion of estimation and testing in the presence of discrete breaks, see Hansen (2001). 
For an advanced discussion of estimation and forecasting when there are slowly evolving coefficients, see 
Hamilton (1994, Chapter 13). 
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Forecast intervals quantify forecast uncertainty. If the errors are normally 
distributed, an approximate 68% forecast interval can be constructed as the 
forecast plus or minus an estimate of the root mean squared forecast error. 

A series that contains a stochastic trend is nonstationary. A random walk 
stochastic trend can be detected using the ADF statistic and can be eliminated 
by using the first difference of the series. 

If the population regression function changes over time, then OLS estimates 
neglecting this instability produce unreliable forecasts. The QLR statistic can 
be used to test for a break, and if a discrete break is found, the regression func- 
tion can be reestimated allowing for the break. 

Pseudo out-of-sample forecasts can be used to estimate the root mean squared 
forecast error, to compare different forecasting models, and to assess model 
stability toward the end of the sample. 


Key Terms 


gross domestic product (GDP) (555) 

first difference (556) 

first lag (556) 

j® lag (556) 

autocorrelation (558) 

serial correlation (558) 

autocorrelation coefficient (558) 

j™ autocovariance (559) 

stationarity (561) 

nonstationarity (562) 

one-step ahead forecast (562) 

multi-step ahead forecast (562) 

forecast error (562) 

mean squared forecast error (MSFE) 
(563) 

root mean squared forecast error 
(RMSFE) (563) 

oracle forecast (565) 

autoregression (565) 

first-order autoregression (565) 

p'-order autoregressive [AR(p)] 
model (567) 

term spread (569) 

autoregressive distributed lag (ADL) 
model (570) 

ADL(p, q) (571) 


weak dependence (572) 

final prediction error (FPE) (574) 

pseudo out-of-sample forecasting (575) 

forecast interval (576) 

fan chart (578) 

Bayes information criterion (BIC) (579) 

Akaike information criterion (AIC) (579) 

trend (582) 

deterministic trend (582) 

stochastic trend (583) 

random walk (583) 

random walk with drift (584) 

unit root (584) 

spurious regression (585) 

Dickey—Fuller statistic (586) 

augmented Dickey—Fuller (ADF) 
statistic (587) 

break (589) 

break date (589) 

Quandt likelihood ratio (QLR) 
statistic (590) 

lag operator (606) 

lag polynomial (606) 
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MyLa b Eco nom i cs If your exam were tomorrow, would you be ready? For each 
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For additional Empirical Exercises and Data Sets, log on to the Companion Website at 
www.pearsonglobaleditions.com. 


chapter, MyLab Economics Practice Tests and Study Plan 


Review the Concepts 


15.1 


15.2 


15.3 


15.4 


Look at the four plots in Figure 15.2—U.S. unemployment rate, U.S. dollar/ 
British pound exchange rate, logarithm of Japan index of industrial produc- 
tion, and the percentage change in daily values. Which of these series appears 
to be nonstationary? Which of them appears to resemble a random walk? 


Many financial economists believe that the random walk model is a good 
description of the logarithm of stock prices. It implies that the percentage 
changes in stock prices are unforecastable. A financial analyst claims to have 
a new model that makes better predictions than the random walk model. 
Explain how you would examine the analyst’s claim that his model is superior. 


A researcher estimates an AR(1) with an intercept and finds that the OLS 
estimate of £ is 0.88, with a standard error of 0.03. Does a 95% confidence 
interval include 6, = 1? Explain. 


Suppose you suspected that the intercept in Equation (15.15) changed in 
1992:Q1. How would you modify the equation to incorporate this change? 
How would you test for a change in the intercept? How would you test for a 
change in the intercept if you did not know the date of the change? 


Exercises 


15.1 


15.2 


Consider the AR(1) model Y, = Bo + B,Y;-1 + u, Suppose the process is 
stationary. 


a. Show that E(Y,) = E(Y,—1). (Hint: Read Key Concept 15.3.) 
b. Show that E(Y,) = Bo/(1 — B;). 


The Index of Industrial Production (/P,) is a monthly time series that measures 
the quantity of industrial commodities produced in a given month. This prob- 
lem uses data on this index for the United States. All regressions are estimated 
over the sample period 1986:M1-2017:M12 (that is, January 1986 through 
December 2017). Let Y, = 1200 x In(IP,/IP,-1). 


15.3. 
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a. A forecaster states that Y, shows the monthly percentage change in ZP, 
measured in percentage points per annum. Is this correct? Why? 


b. Suppose she estimates the following AR(4) model for Y; 


A 


Ê = 0.749 + 0.071Y,_; + 0.170Y,—> + 0.216Y,_3 + 0.167Y,—4. 
(0.488) (0.088) (0.053) (0.078) (0.064) 


Use this AR(4) to forecast the value of Y,in January 2018, using the following 
values of ZP for July 2017 through December 2017: 


Date 2017:M7 2017:M8 2017:M9 = 2017:M10 = 2017:M11 = 2017:M12 
IP 105.01 104.56 104.82 106.58 106.86 107.30 


c. Worried about potential seasonal fluctuations in production, she adds 
Y,—12 to the autoregression. The estimated coefficient on Y,—12 is —0.061, 
with a standard error of 0.043. Is this coefficient statistically significant? 


d. Worried about a potential break, she computes a OLR test (with 15% 
trimming) on the constant and AR coefficients in the AR(4) model. The 
resulting QLR statistic is 1.80. Is there evidence of a break? Explain. 


e. Worried that she might have included too few or too many lags in the 
model, the forecaster estimates AR(p) models for p = 0,1,...,6 over 
the same sample period. The sum of squared residuals from each of 
these estimated models is shown in the table. Use the BIC to estimate 
the number of lags that should be included in the autoregression. Do the 
results differ if you use the AIC? 


AR Order 0 1 2 3 4 5 6 
SSR 21,045 20,043 18,870 17,838 17,344 17,337 17,306 


Using the same data as in Exercise 15.2, a researcher tests for a stochastic 
trend in In(/P,), using the following regression: 


Aln (IP) = 0.026 + 0.000097f — 0.0070 In(IP,_,) + 0.068AIn(IP,1) 


(0.013) (0.000067) (0.0037) (0.050) 
+ 0.169AIn(IP,_7) + 0.219AIn(IP,_3) + 0.173AIn(IP,_.), 
(0.049) (0.050) (0.051) 


where the standard errors shown in parentheses are computed using the 
homoskedasticity-only formula and the regressor t is a linear time trend. 
a. Use the ADF statistic to test for a stochastic trend (unit root) in In(/P). 


b. Do these results support the specification used in Exercise 15.2? 
Explain. 
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15.4 


15.5 


15.6 


The forecaster in Exercise 15.2 augments her AR(4) model for JP growth to 
include four lagged values of AR, where R, is the interest rate on three-month 
US. Treasury bills (measured in percentage points at an annual rate). 


a. The F-statistic on the four lags of AR, is 3.91. Do interest rates help 
predict JP growth? Explain. 


b. The researcher also regresses AR, on a constant, four lags of AR,, and 
four lags of IP growth. The resulting F-statistic on the four lags of ZP 
growth is 1.48. Does /P growth help to predict interest rates? Explain. 


Prove the following results about conditional means, forecasts, and forecast 
errors: 


a. Let W be a random variable with mean py and variance o%,, and let c be 
a constant. Show that E[(W — c)?] = 0%, + (uw — c)? 


b. Consider the problem of forecasting Y, using data on Y;_1, Yj-2,.... 
Let f,_,; denote some forecast of Y, where the subscript t — 1 on fi—1 
indicates that the forecast is a function of data through date t — 1. 

Let E[ (Y, — f-1)7| Y-1, ¥;-2,...] be the conditional mean squared 
error of the forecast f;_;, conditional on values of Y observed through 
date t — 1. Show that the conditional mean squared forecast error is 
minimized when f,_; = Yj;—-1, where Y,,-1 = E(Y,| Y-i Y-2,-..). 
(Hint: Review Appendix 2.2.) 

c. Letu denote the error in Equation (15.12). Show that cov(u;, u,;—;) = 0 
for j # 0. [Hint: Use Equation (2.28).] 


In this exercise, you will conduct a Monte Carlo experiment to study the phe- 
nomenon of spurious regression discussed in Section 15.7 In a Monte Carlo 
study, artificial data are generated using a computer, and then those artificial 
data are used to calculate the statistics being studied. This makes it possible 
to compute the distribution of statistics for known models when mathemati- 
cal expressions for those distributions are complicated (as they are here) or 
even unknown. In this exercise, you will generate data so that two series, Y, 
and X, are independently distributed random walks. The specific steps are as 
follows: 


i. Use your computer to generate a sequence of T = 100 i.i.d. standard nor- 
mal random variables. Call these variables e}, e2, . . . , €100. Set Y; = e; and 
Y, = Y,-1 + efor t = 2,3,..., 100. 

ii. Use your computer to generate a new sequence, a1, 47,... ,&100,0f T = 100 
i.i.d. standard normal random variables. Set X| = a, and X, = X;_; + a, 
for t = 2,3,..., 100. 

iii. Regress Y, onto a constant and X,. Compute the OLS estimator, the regres- 
sion R?, and the (homoskedasticity-only) t-statistic testing the null hypoth- 
esis that £, (the coefficient on X,) is 0. 


15.7 


15.8 


15.9 
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Use this algorithm to answer the following questions: 


a. Run the algorithm (i) through (iii) once. Use the t-statistic from (iii) to 
test the null hypothesis that 6, = 0, using the usual 5% critical value of 
1.96. What is the R? of your regression? 


b. Repeat (a) 1000 times, saving each value of R? and the t-statistic. Con- 
struct a histogram of the R? and t-statistic. What are the 5%, 50%, and 
95% percentiles of the distributions of the R? and the t-statistic? In what 
fraction of your 1000 simulated data sets does the t-statistic exceed 1.96 
in absolute value? 


c. Repeat (b) for different numbers of observations, such as T = 50 and 
T = 200. As the sample size increases, does the fraction of times that 
you reject the null hypothesis approach 5%, as it should because you 
have generated Y and X to be independently distributed? Does this 
fraction seem to approach some other limit as T gets large? What is that 
limit? 

Suppose Y, follows the stationary AR(1) model Y, = 2.5 + 0.7Y,-; + un 

where uis i.i.d. with E(u,) = 0 and var(u,) = 9. 


a. Compute the mean and variance of Y,. (Hint: See Exercise 15.1.) 
b. Compute the first two autocovariances of Y, (Hint: Read Appendix 15.2.) 
c. Compute the first two autocorrelations of Y,. 


d. Suppose Yr = 102.3. Compute Yr+1)7 = E(Yr+1 | Yr, Views te 


Suppose Y, is the monthly value of the number of new home construction 
projects started in the United States. Because of the weather, Y, has a pro- 
nounced seasonal pattern; for example, housing starts are low in January 
and high in June. Let ujan denote the average value of housing starts in 
January, and let prep, MMar +++ , Dec denote the average values in the other 
months. Show that the values of Wjan, MFeb, - - - » Dec Can be estimated from 
the OLS regression Y, = By + B, Feb, + BoMar, +--+ + BuDec, + u,,where 
Feb, is a binary variable equal to 1 if t is February, Mar, is a binary variable 
equal to 1 if tis March, and so forth. (Hint: Show that By + Bo = Maar and 
so forth.) 


The moving average model of order g has the form 


Y, = By + e, Digg + Doda ot POD Bw, 


where e, is a serially uncorrelated random variable with mean 0 and variance g2. 
Show that E( Y;) = Bo. 


a. 
b. Show that the variance of Y, is var( Y,) = o¢(1 + bj + b3 + +++ + b2). 


p 


Show that p; = 0 for j > q. 


a 


Suppose q = 1. Derive the autocovariances for Y. 
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15.10 


15.11 


15.12 


15.13 


A researcher carries out a QLR test using 30% trimming, and there are g = 5 
restrictions. Answer the following questions, using the values in Table 15.5 
(“Critical Values of the QLR Statistic with 15% Trimming”) and Appendix 
Table 4 (“Critical Values for the Fn» Distribution’). 


a. The QLR F-statistic is 3.9. Should the researcher reject the null 
hypothesis at the 5% level? 

b. The QLR F-statistic is 1.1. Should the researcher reject the null 
hypothesis at the 5% level? 


c. The QLR F-statistic is 3.6. Should the researcher reject the null 
hypothesis at the 5% level? 


Suppose AY, follows the AR(1) model AY, = By + BAY; + u, 


a. Show that Y, follows an AR(2) model. 
b. Derive the AR(2) coefficients for Y, as a function of By and 64. 


Consider the stationary AR(1) model Y, = By + BY,- + un where uis iid. 
with mean 0 and variance ø? . The model is estimated using data from time 
periods t = 1 through t = T, yielding the OLS estimators Bo and Bi. You are 
interested in forecasting the value of Y at time T + 1—that is, Yr 1. Denote 
the forecast by Yr 17 = Bo + Ê, Yr. 


A 


a. Show that the forecast error is Yr, — Fre ir = ura — [(Bo — Bo) + 
(Bı — Bi)Y7!.- 
b. Show that w+, is independent of Yy. 


c. Show that uy+, is independent of Bo and Bi 
d. Show that var( Yr+ ijr — Yrsiir) = ø? + var[(By — Bo)+(Bi — Bı) Yr]. 


Suppose Y, follows a random walk, Y, = Yı + u,, fort = 1,..., 7, where 


Y) = 0 and uis i.i.d. with mean 0 and variance g2. 


a. Compute the mean and variance of Y, 
b. Compute the covariance between Y, and Y, x. 


c. Use the results in (a) and (b) to show that Y, is nonstationary. 


Empirical Exercises 


E15.1 


On the text website, http://www.pearsonglobaleditions.com, you will find the 
data file USMacro_Quarterly, which contains quarterly data on several mac- 
roeconomic series for the United States; the data are described in the file 
USMacro_Description. The variable PCEP is the price index for personal con- 
sumption expenditures from the U.S. National Income and Product Accounts. 
In this exercise, you will construct forecasting models for the rate of inflation 
based on PCEP. For this analysis, use the sample period 1963:Q1-2017:04 
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(where data before 1963 may be used, as necessary, as initial values for lags in 


regressions). 


a. 


f. 


i. 


ii. 


m. 


ii. 


jæi o 


iii. 


iv. 


me. 


iii. 


iv. 


Compute the inflation rate, Infl = 400 xX [In( PCEP,) — In( PCEP,_,)]. 
What are the units of Infl? (Is Inf! measured in dollars, percentage 
points, percentage per quarter, percentage per year, or something 
else? Explain.) 

Plot the value of Inf! from 1963:Q1 through 2017:Q4. Based on the 
plot, do you think that /nfl has a stochastic trend? Explain. 


. Compute the first four autocorrelations of AJnfl. 


Plot the value of A/nfl from 1963:Q1 through 2017:Q4. The plot 
should look choppy or jagged. Explain why this behavior is consistent 
with the first autocorrelation that you computed in (i). 


. Run an OLS regression of A/nfl, on Alnfl,_;. Does knowing the 


change in inflation over the current quarter help predict the change 
in inflation over the next quarter? Explain. 


. Estimate an AR(2) model for AJnfl. Is the AR(2) model better than 


an AR(1) model? Explain. 


Estimate an AR(p) model for p = 0, ..., 8. What lag length is cho- 
sen by the BIC? What lag length is chosen by the AIC? 

Use the AR(2) model to predict the change in inflation from 2017:04 
to 2018:Q1 —that is, to predict the value of AInflz18.01. 


Use the AR(2) model to predict the level of the inflation rate in 
20 18:01 —that is, Inflog1s-01- 


. Use the ADF test for the regression in Equation (15.32) with two lags 


of A/nfl (so that p = 3 in Equation (15.32)) to test for a stochastic 
trend in Jnfl. 


. Is the ADF test based on Equation (15.32) preferred to the test based 


on Equation (15.33) for testing for a stochastic trend in Infl? Explain. 
In (i), you used two lags of A/nfl. Should you use more lags? Fewer 
lags? Explain. 

Based on the test you carried out in (i), does the AR model for Inf! 


contain a unit root? Explain carefully. (Hint: Does the failure to 
reject a null hypothesis mean that the null hypothesis is true?) 


. Use the QLR test with 15% trimming to test the stability of the coeffi- 


cients in the AR(2) model for A/nfl. Is the AR(2) model stable? Explain. 


i: 


Using the AR(2) model for A/nfl with a sample period that begins in 
1963:Q1, compute pseudo out-of-sample forecasts for the change in infla- 
tion beginning in 2003:Q1 and going through 2017:Q4. (That is, compute 


AInflz993:01 |2002:04, AInflz993:92|2003:01 tee AInfl917:04|2017:03-) 
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ii. Are the pseudo out-of-sample forecasts biased? That is, do the fore- 
cast errors have a nonzero mean? 


iii. How large is the RMSFE of the pseudo out-of-sample forecasts? Is 
this consistent with the AR(2) model for A/nfl estimated over the 
1963:Q1—2002:04 sample period? 


iv. There is a large outlier in 2008:Q4. Why did inflation fall so much in 
2008:Q4? (Hint: Collect some data on oil prices. What happened to oil 
prices during 2008?) 


E15.2 Read the box “Can You Beat the Market?” Next go to the course website, 
where you will find an extended version of the data set described in the box; 
the data are in the file Stock_Returns_1931_2002 and are described in the file 
Stock_Returns_1931_2002_Description. 


a. Repeat the calculations reported in Table 15.2 using regressions 
estimated over the 1932:M1-—2002:M12 sample period. 


b. Construct pseudo out-of-sample forecasts of excess returns over the 
1983:M1-2002:M12 period using regressions that begin in 1932:M1. 


c. Do the results in (a)-(b) suggest any important changes to the conclu- 
sions reached in the box? Explain. 


Time Series Data Used in Chapter 15 


Macroeconomic time series data for the United States are collected and published by various 
government agencies. The Bureau of Economic Analysis in the Department of Commerce 
publishes the National Income and Product Accounts, which include the GDP data used in this 
chapter. The unemployment rate is computed from the Bureau of Labor Statistics’ Current 
Population Survey (see Appendix 3.1). The quarterly data used here were computed by aver- 
aging the monthly values. The 10-year Treasury bond rate, 3-month Treasury bill rate, and the 
$/£ exchange rate data are quarterly averages of daily rates, as reported by the Federal Reserve 
System. The Japan Index of Industrial Production is published by the Organisation for Eco- 
nomic Co-operation and Development (OECD). The daily percentage change in the Wilshire 
5000 Total Market Index, a stock price index, was computed as 100AIn(W5000,), where W5000, 
is the daily value of the index; because the stock exchange is not open on weekends and holi- 
days, the time period of analysis is a business day. We obtained all these data series from the 
Federal Reserve Economic Data (FRED) website at the Federal Reserve Bank of St. Louis. 
There you can find times series data on thousands of macroeconomic variables. 

The regressions in Table 15.2 use monthly financial data for the United States. Stock 
prices (P,) are measured by the broad-based (NYSE and AMEX), value-weighted index of 
stock prices constructed by the Center for Research in Security Prices (CRSP). The monthly 
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percentage excess return is 100 x {In[(P, + Div,)/P,-1] — ln( TBill,) }, where Div, is the 
dividends paid on the stocks in the CRSP index and TBill, is the gross return (1 plus the inter- 
est rate) on a 30-day Treasury bill during month t. We thank Motohiro Yogo for providing both 
his help and these data. 


Stationarity in the AR(1) Model 


This appendix shows that if | 6, | < 1 and u; is stationary, then Y, is stationary. Recall from 
Key Concept 15.3 that the time series variable Y, is stationary if the joint distribution of 
(Yji1,---, Y+ r) does not depend on s, regardless of the value of T. To streamline the argu- 
ment, we show this for T = 2 under the simplifying assumptions that By) = 0 and {u,} are iid. 
N(0, 02). 

The first step is deriving an expression for Y, in terms of the ups. Because By = 0, Equa- 
tion (15.8) implies that Y, = 6, Y,-, + u, Substituting Y,_; = B,Y,;-2 + u;,—_ into this expres- 


sion yields Y, = B;(B:Y;-2 + u;-1) + u, = BY,- + By, + u, Continuing this substitution 


another step yields Y, = Bi Y;_3 + Biu,-2 + Byu,-1 + up and continuing indefinitely yields 


Y, = u, + Buty + Bura + Buz +--+ = > piui (15.37) 


Thus Y, is a weighted average of current and past u,’s. Because the u,’s are normally distributed 
and because the weighted average of normal random variables is normal (Section 2.4), Y,,, and 
Y,42 have a bivariate normal distribution. Recall from Section 2.4 that the bivariate normal 
distribution is completely determined by the means of the two variables, their variances, and 
their covariance. Thus, to show that Y, is stationary, we need to show that the means, variances, 
and covariance of ( Y,,;, %+2) do not depend on s. An extension of the argument used below 
can be used to show that the distribution of (Y%41, Y+2,..., %+47) does not depend ons. 

The means and variances of Y,,, and Y,,, can be computed using Equation (15.37), 
with the subscript s+ 1 or s +2 replacing t. First, because E(u,) = 0 for all t, 
E(Y,) = E(X zobu) = Do BiE(w-i) = 0, so the means of Y,,, and Y,,> are both 
0 and in particular do not depend on s. Second, var(Y,) = var( Dj—o Biu;—;) = 
>Y}=0( Bi)? var(u,_;) = oX o(p = 0% /(1 — Bi), where the final equality follows from the 
fact thatif |a| < 1,5)2 9a’ = 1/(1 — a);thusvar(¥,,,) = var(Y42) = 02/(1 — Bf). Finally, 
because Yo42 = BiYs41 + Us+2, COV(Y541,%o+2) =E( V1 Yo42) = ELM 4108 %s41 + Us+2)] = 
Bivar( ¥,41) + cov( Yoa1, ts+2) = Brvar(Y+1) = Bion / (1 — Bi). 

The covariance does not depend on s, so Y,,1 and Y,+2 have a joint probability distribu- 


tion that does not depend on s; that is, their joint distribution is stationary. If |, | = 1, this 
calculation breaks down because the infinite sum in Equation (15.37) does not converge, and 


the variance of Y, is infinite. Thus Y, is stationary if |6,| < 1 but not if |, | = 1. 
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The preceding argument was made under the assumptions that By) = 0 and u, is normally 
distributed. If By # 0, the argument is similar except that the means of Y,,; and Y,,> are 
Bo/(1 — Bı) and Equation (15.37) must be modified for this nonzero mean. The assumption 
that w, is i.i.d. normal can be replaced with the assumption that u, is stationary with a finite 
variance because, by Equation (15.37), Y, can still be expressed as a function of current and 
past u,’s,so the distribution of Y, is stationary as long as the distribution of u, is stationary and 
the infinite sum expression in Equation (15.37) is meaningful in the sense that it converges, 
which requires that | B; | < 1. 


Lag Operator Notation 


The notation in this and the next two chapters is streamlined considerably by adopting what 
is known as lag operator notation. Let L denote the lag operator, which has the property that 
it transforms a variable into its lag. That is, the lag operator L has the property LY, = Y,_1. By 
applying the lag operator twice, one obtains the second lag: L?Y, = L(LY,) = LY_,; = Y_». 
More generally, by applying the lag operator j times, one obtains the j™ lag. In summary, the 


lag operator has the property that 
LY, = ¥-1, LY, = Y,- and VY, = ¥,-;. (15.38) 


The lag operator notation permits us to define the lag polynomial, which is a polynomial in 


the lag operator: 


P 
a(L) = a + aL + alL t o + ap = Sal, (15.39) 
j=0 


where ao, ... , a, are the coefficients of the lag polynomial and L° = 1. The degree of the lag 


polynomial a(L) in Equation (15.39) is p. Multiplying Y, by a(L) yields 


P ; P P 
a(L)Y, = (Sau)x = 2a(L%) = yay i = aY, + aY;-1 + +++ + apY,—p. (15.40) 
J= J= I= 


The expression in Equation (15.40) implies that the AR(p) model in Equation (15.12) can be 


written compactly as 
a(L)Y, = By + Up (15.41) 
where ap = 1 anda = —§;, for j = 1,... , p. Similarly, an ADL(p, q) model can be written 
a(L)¥, = By + c(L)X-1 + us (15.42) 


where a(L) is a lag polynomial of degree p with ay = 1 and c(L) is a lag polynomial of 
degree q — 1. 
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ARMA Models 


The autoregressive-moving average (ARMA) model extends the autoregressive model by 
modeling u, as serially correlated — specifically, as being a distributed lag (or moving average) 
of another unobserved error term. In the lag operator notation of Appendix 15.3, let 
u, = b(L)e, where b(L) is a lag polynomial of degree q with bọ = 1 and ẹ¢ is a serially uncor- 
related, unobserved random variable. Then the ARMA (p, q) model is 


a(L)Y, = By + b(L)e, (15.43) 


where a(L) is a lag polynomial of degree p with ay = 1. 

Both the AR and ARMA models can be thought of as ways to approximate the autoco- 
variances of Y, The reason for this is that any stationary time series Y, with a finite variance 
can be written either as an AR or as a MA with a serially uncorrelated error term, although 
the AR or MA model might need to have an infinite order. The second of these results, that a 
stationary process can be written in moving average form, is known as the Wold decomposition 
theorem and is one of the fundamental results underlying the theory of stationary time series 
analysis. 

The families of AR, MA, and ARMA models are equally rich as long as the lag polynomi- 
als have a sufficiently high degree. In some cases, the autocovariances can be better approxi- 
mated by an ARMA (p, q) model with small p and q than by a pure AR model with only a few 
lags. That said, ARMA models are more difficult to extend to additional regressors than are 
AR models. 


Consistency of the BIC Lag Length Estimator 


This appendix summarizes the argument that the BIC estimator of the lag length, p, in an 
autoregression is correct in large samples; that is, Pr(p = p) — 1. This is not true for the AIC 


estimator, which can overestimate p even in large samples. 


BIC 


First consider the special case in which the BIC is used to choose among autoregressions 
with zero, one, or two lags, when the true lag length is one. It is shown below that 
(i) Pr(p = 0) —> 0 and (ii) Pr(p = 2) — 0, from which it follows that Pr(p = 1) > 1. The 
extension of this argument to the general case of searching over 0 = p = Pmax entails show- 
ing that Pr( < p) ~0 and Pr(p > p) — 0; the strategy for showing these is the same as 


used in (i) and (ii) below. 
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Proof of (i) and (ii) 


Proof of (i). To choose p = 0, it must be the case that BIC(0) < BIC(1); that 
is, BIC(0) — BIC(1) < 0. Now BIC(0) — BIC(1) = [In(SSR(0)/T) + (InT)/T] - 
[In(SSR(1)/T) + 2(InT)/T] = In(SSR(0)/T) — In(SSR(1)/T) — (InT)/T. Now 
SSR(0)/T = [(T — 1)/T]s} — øo}, SSR(1)/T > 02, and (InT) /T —> 0; putting 
these pieces together, BIC(0) — BIC(1) > Ino} — Ino? > 0 because a} > g}. It fol- 
lows that Pr[ BIC(0) < BIC(1)]—0,so Pr(p = 0) —> 0. 


Proof of (ii). To choose p = 2, it must be the case that BIC(2) < BIC(1) or 
BIC(2) — BIC(1) < 0. Now T/[BIC(2) — BIC(1)] = T{[In(SSR(2)/T) + 3(nT)/T] 
— [In(SSR(1)/T) + 2(InT)/T]} = TIn[ SSR(2)/SSR(1)] + nT = -Tln[1+F/(T-2)] 
+ InT, where F = [SSR(1) — SSR(2)]/[SSR(2)/(T — 2)] is the homoskedasticity-only 
F-statistic [Equation (7.13)] testing the null hypothesis that 8) = 0 in the AR(2). If u, is 
homoskedastic, then F has a x} asymptotic distribution; if not, it has some other asymp- 
totic distribution. Thus Pr[| BIC(2) — BIC(1) < 0] = Pr{7[BIC(2) — BIC(1)] < 0}= 
Pr{ — Tin[1 + F/(T-—2)] + (nT) < 0} = Pr{Tlin[1 + F/(T — 2)] > InT}. As T 


increases, Tln[1 + F/ 


T-—2)]-F—»0 [a consequence of the logarithmic 


YS ~~ 


approximation In(1 + a) = a, which becomes exact as a —> 0]. Thus Pr[ BIC(2) — 
BIC(1) < 0] —> Pr(F > InT) —> 0,so Pr(p = 2) — 0. 


AIC 


In the special case of an AR(1) when zero, one, or two lags are considered, the proof of (i) for 
the BIC applies to the AIC where the term InT is replaced by 2,so Pr(p = 0) —> 0.All the 
steps in the proof of (ii) for the BIC also apply to the AIC, with the modification that InT is 
replaced by 2; thus Pr[ AIC(2) — AIC(1) < 0] —> Pr(F > 2) > 0. If uis homoskedastic, 
then Pr(F > 2) —> Pr(xj > 2) = 0.16, so Pr(p = 2) —> 0.16. In general, when # is 
chosen using the AIC, Pr(p < p) —> 0, but Pr(p > p) tends to a positive number, so 
Pr(p = p) does not tend to 1. 


6 Estimation of Dynamic Causal Effects 


n the 1983 movie Trading Places, the characters played by Dan Aykroyd and Eddie 

Murphy used inside information on how well Florida oranges had fared over the 
winter to make millions in the orange juice concentrate futures market, a market for 
contracts to buy or sell large quantities of orange juice concentrate at a specified price 
on a future date. In real life, traders in orange juice futures, in fact, do pay close attention 
to the weather in Florida: Freezes in Florida kill Florida oranges, the source of almost all 
frozen orange juice concentrate made in the United States, so its supply falls and the price 
rises. But precisely how much does the price rise when the weather in Florida turns sour? 
Does the price rise all at once, or are there delays; if so, for how long? These are ques- 
tions that real-life traders in orange juice futures need to answer if they want to succeed. 

This chapter takes up the problem of estimating the effect on Y now and in the 
future of a change in X—that is, the dynamic causal effect on Y of a change in X. 
What, for example, is the effect on the path of orange juice prices over time of a freez- 
ing spell in Florida? The starting point for modeling and estimating dynamic causal 
effects is the so-called distributed lag regression model, in which Y; is expressed as a 
function of current and past values of X,. Section 16.1 introduces the distributed lag 
model in the context of estimating the effect of cold weather in Florida on the price of 
orange juice concentrate over time. Section 16.2 takes a closer look at what, precisely, 
is meant by a dynamic causal effect. 

One way to estimate dynamic causal effects is to estimate the coefficients of the 
distributed lag regression model using ordinary least squares (OLS). As discussed in 
Section 16.3, this estimator is consistent if the regression error has a conditional mean 
of 0 given current and past values of X, a condition that is referred to as exogeneity (as 
in Chapter 12). Because the omitted determinants of Y, are correlated over time—that 
is, because they are serially correlated—the error term in the distributed lag model 
can be serially correlated. This possibility in turn requires heteroskedasticity- and 
autocorrelation-consistent (HAC) standard errors, the topic of Section 16.4. 

A second way to estimate dynamic causal effects, discussed in Section 16.5, is to 
model the serial correlation in the error term as an autoregression and then to use this 
autoregressive model to derive an autoregressive distributed lag (ADL) model. Alterna- 
tively, the coefficients of the original distributed lag model can be estimated by general- 
ized least squares (GLS). Both the ADL and the GLS methods, however, require a stronger 
version of exogeneity than we have used so far: strict exogeneity, under which the regres- 
sion errors have a conditional mean of 0 given past, present, and future values of X. 

Section 16.6 provides a more complete analysis of the relationship between 
orange juice prices and the weather. In this application, the weather is exogenous 
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(although, as discussed in Section 16.6, economic theory suggests that it is not neces- 
sarily strictly exogenous). Because exogeneity is necessary for estimating dynamic 
causal effects, Section 16.7 examines this assumption in several applications taken 
from macroeconomics and finance. 

This chapter builds on the material in Sections 15.1 through 15.4 but, with the 
exception of a subsection (that can be skipped) of the empirical analysis in Section 16.6, 
does not require the material in Sections 15.5 through 15.7. 


An Initial Taste of the Orange Juice Data 


Orlando, the historical center of Florida’s orange-growing region, is normally sunny 
and warm. But now and then there is a cold snap, and if temperatures drop below 
freezing for too long, the trees drop many of their oranges. If the cold snap is severe, 
the trees freeze. Following a freeze, the supply of orange juice concentrate falls, and 
its price rises. The timing of the price increases is rather complicated, however. 
Orange juice concentrate is a “durable,” or storable, commodity; that is, it can be 
stored in its frozen state, albeit at some cost (to run the freezer). Thus the price of 
orange juice concentrate depends not only on current supply but also on expecta- 
tions of future supply. A freeze today means that future supplies of concentrate will 
be low, but because concentrate currently in storage can be used to meet either cur- 
rent or future demand, the price of existing concentrate rises today. But precisely 
how much does the price of concentrate rise when there is a freeze? The answer to 
this question is of interest not just to orange juice traders but more generally to 
economists interested in studying the operations of commodity markets. To learn 
how the price of orange juice changes in response to weather conditions, we must 
analyze data on orange juice prices and the weather. 

Monthly data on the price of frozen orange juice concentrate, its monthly per- 
centage change, and temperatures in the orange-growing region of Florida from 
January 1950 to December 2000 are plotted in Figure 16.1. The price, plotted in Fig- 
ure 16.1a, is a measure of the average real price of frozen orange juice concentrate 
paid by wholesalers. This price was deflated by the overall producer price index for 
finished goods to eliminate the effects of overall price inflation. The percentage price 
change plotted in Figure 16.1b is the percentage change in the price over the month. 
The temperature data plotted in Figure 16.1c are the number of freezing degree days 
at the Orlando, Florida, airport, calculated as the sum of the number of degrees Fahr- 
enheit that the minimum temperature falls below freezing in a given day over all days 
in the month; for example, in November 1950 the airport temperature dropped below 
freezing twice, on the 25" (31°F) and on the 29" (29°F), for a total of 4 freezing 
degree days [(32 — 31) + (32 — 29) = 4]. (The data are described in more detail 
in Appendix 16.1.) As you can see by comparing the panels in Figure 16.1, the price 
of orange juice concentrate has large swings, some of which appear to be associated 
with cold weather in Florida. 
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| FIGURE 16.1 | Orange Juice Prices and Florida Weather, 1950-2000 
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(c) Monthly freezing degree days in Orlando, Florida 


There have been large month-to-month changes in the price of frozen concentrated orange juice. 
Many of the large movements coincide with freezing weather in Orlando, home of many orange groves. 


We begin our quantitative analysis of the relationship between orange juice price 
and the weather by using a regression to estimate the amount by which orange juice 
prices rise when the weather turns cold. The dependent variable is the percentage 
change in the price over that month [% Chg P, where %Chg P, = 100 x Aln (P9) 
and P9 is the real price of orange juice]. The regressor is the number of freezing 
degree days during that month (FDD,). This regression is estimated using monthly 
data from January 1950 to December 2000 (as are all regressions in this chapter), for 
a total of T = 612 observations: 


% Chg P, = —0.40 + 0.47 FDD,. 
(0.22) (0.13) (16.1) 


The standard errors reported in this section are not the usual OLS standard errors 
but rather are HAC standard errors that are appropriate when the error term and 
regressors are autocorrelated. HAC standard errors are discussed in Section 16.4, and 
for now, they are used without further explanation. 
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According to this regression, an additional freezing degree day during a month 
increases the price of orange juice concentrate over that month by 0.47%. In a month 
with 4 freezing degree days, such as November 1950, the price of orange juice con- 
centrate is estimated to have increased by 1.88% (4 X 0.47% = 1.88%) relative to 
a month with no days below freezing. 

Because the regression in Equation (16.1) includes only a contemporaneous 
measure of the weather, it does not capture any lingering effects of the cold snap on 
the orange juice price over the coming months. To capture these we need to consider 
the effect on prices of both contemporaneous and lagged values of FDD, which in 
turn can be done by augmenting the regression in Equation (16.1) with, for example, 
lagged values of FDD over the previous six months: 


%ChgP, = —0.65 + 0.47FDD, + 0.14FDD,_, + 0.06FDD,_» 
(0.23) (0.14) (0.08) (0.06) 


+ 0.07 FDD,-3 + 0.03 FDD,-4 + 0.05FDD,-5 + 0.05FDD,-6. (16.2) 
(0.05) (0.03) (0.03) (0.04) 


Equation (16.2) is a distributed lag regression. The coefficient on FDD, in Equation 
(16.2) estimates the percentage increase in prices over the course of the month in 
which the freeze occurs; an additional freezing degree day is estimated to increase 
prices that month by 0.47%. The coefficient on the first lag of FDD, FDD,-1, esti- 
mates the percentage increase in prices arising from a freezing degree day in the 
preceding month, the coefficient on the second lag estimates the effect of a freezing 
degree day two months ago, and so forth. Equivalently, the coefficient on the first lag 
of FDD estimates the effect of a unit increase in FDD one month after the freeze 
occurs. Thus the estimated coefficients in Equation (16.2) are estimates of the effect 
of a unit increase in FDD, on current and future values of % ChgP; that is, they are 
estimates of the dynamic effect of FDD, on %ChgP,. For example, the 4 freezing 
degree days in November 1950 are estimated to have increased orange juice prices 
by 1.88% during November 1950, by an additional 0.56% (= 4 X 0.14) in December 
1950, by an additional 0.24% (= 4 x 0.06) in January 1951, and so forth. 


Dynamic Causal Effects 


Before learning more about the tools for estimating dynamic causal effects, we 
should spend a moment thinking about what, precisely, is meant by a dynamic causal 
effect. Having a clear idea about what a dynamic causal effect is leads to a clearer 
understanding of the conditions under which it can be estimated. 


Causal Effects and Time Series Data 


Section 1.2 defined a causal effect as the outcome of an ideal randomized controlled 
experiment: When a horticulturalist randomly applies fertilizer to some tomato plots 
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but not others and then measures the yield, the expected difference in yield between 
the fertilized and unfertilized plots is the causal effect on tomato yield of the fertil- 
izer. This concept of an experiment, however, is one in which there are multiple 
subjects (multiple tomato plots or multiple people), so the data are either cross- 
sectional (the tomato yield at the end of the harvest) or panel data (individual 
incomes before and after an experimental job training program). By having multiple 
subjects, it is possible to have both treatment and control groups and thereby to esti- 
mate the causal effect of the treatment. 

In time series applications, this definition of causal effects in terms of an ideal 
randomized controlled experiment needs to be modified. To be concrete, consider an 
important problem of macroeconomics: estimating the effect of the central bank 
making an unanticipated change in the short-term interest rate on the current and 
future economic activity in a given country, as measured by gross domestic product 
(GDP). Taken literally, the randomized controlled experiment of Section 1.2 would entail 
randomly assigning different economies to treatment and control groups. The central 
banks in the treatment group would apply the treatment of a random interest rate change, 
while those in the control group would apply no such random changes; for both groups, 
economic activity (for example, GDP) would be measured over the next few years. But 
what if we are interested in estimating this effect for a specific country — say, the United 
States? Then this experiment would entail having different “clones” of the United States 
as subjects and assigning some clone economies to the treatment group and some to the 
control group. Obviously, this “parallel universes” experiment is infeasible. 

Instead, in time series data it is useful to think of a randomized controlled experi- 
ment as consisting of the same subject (e.g., the U.S. economy) being given different 
treatments (randomly chosen changes in interest rates) at different points in time 
(the 1970s, the 1980s, and so forth). In this framework, the single subject at different 
times plays the role of both treatment and control group: Sometimes the Fed changes 
the interest rate, while at other times it does not. Because data are collected over 
time, it is possible to estimate the dynamic causal effect — that is, the time path of the 
effect on the outcome of interest of the treatment. For example, a surprise increase 
in the short-term interest rate of 2 percentage points, sustained for one quarter, might 
initially have a negligible effect on output; after two quarters, GDP growth might 
slow, with the greatest slowdown after six quarters; then over the next 2 years, GDP 
growth might return to normal. This time path of causal effects is the dynamic causal 
effect on GDP growth of a surprise change in the interest rate. 

As a second example, consider the causal effect on orange juice price changes of a 
freezing degree day. It is possible to imagine a variety of hypothetical experiments, each 
yielding a different causal effect. One experiment would be to change the weather in 
the Florida orange groves, holding weather constant elsewhere —for example, holding 
weather constant in the Texas grapefruit groves and in other citrus fruit regions. This 
experiment would measure a partial effect, holding other weather constant. A second 
experiment might change the weather in all the regions, where the “treatment” is appli- 
cation of overall weather patterns. If weather is correlated across regions for competing 
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crops, then these two dynamic causal effects differ. In this chapter, we consider the 
causal effect in the latter experiment —that is, the causal effect of applying general 
weather patterns. This corresponds to measuring the dynamic effect on prices of a 
change in Florida weather, not holding weather constant in other agricultural regions. 


Dynamic effects and the distributed lag model. Because dynamic effects necessar- 
ily occur over time, the econometric model used to estimate dynamic causal effects 
needs to incorporate lags. To do so, Y, can be expressed as a distributed lag of current 
and r past values of X; 


Y, = Bo + BX, + BoXi-1 + B3X—-2 +++ + Br+1Xi-r + Up (16.3) 


where u, is an error term that includes the measurement error in Y, and the effect of 
omitted determinants of Y,. The model in Equation (16.3) is called the distributed lag 
model relating X, and r of its lags, to Y, 

As an illustration of Equation (16.3), consider a modified version of the tomato/ 
fertilizer experiment: Because fertilizer applied today might remain in the ground in 
future years, the horticulturalist wants to determine the effect on tomato yield over 
time of applying fertilizer. Accordingly, she designs a three-year experiment and ran- 
domly divides her plots into four groups: The first is fertilized in only the first year; the 
second is fertilized in only the second year; the third is fertilized in only the third year; 
and the fourth, the control group, is never fertilized. Tomatoes are grown annually in each 
plot, and the third-year harvest is weighed. The three treatment groups are denoted by 
the binary variables X,-2, X;-1, and X, where t represents the third year (the year in 
which the harvest is weighed), X,_, = 1 if the plot is in the first group (fertilized two 
years earlier), X,_, = 1 if the plot was fertilized one year earlier, and X, = 1 if the plot 
was fertilized in the final year. In the context of Equation (16.3) (which applies to a single 
plot), the effect of being fertilized in the final year is £4, the effect of being fertilized one 
year earlier is 6, and the effect of being fertilized two years earlier is 63. If the effect of 
fertilizer is greatest in the year it is applied, then A, will be larger than By and 3. 

More generally, the coefficient on the contemporaneous value of X, B;, 1s the 
contemporaneous or immediate effect of a unit change in X, on Y,. The coefficient on 
X;—1, Bb, is the effect on Y, of a unit change in X;_, or, equivalently, the effect on Y,+1 
of a unit change in X; that is, B is the effect of a unit change in X on Y one period 
later. In general, the coefficient on X,- is the effect of a unit change in X on Y after 
h periods. The dynamic causal effect is the effect of a change in X, on Y, Yii1, Yara, 
and so forth; that is, it is the sequence of causal effects on current and future values 
of Y. Thus, in the context of the distributed lag model in Equation (16.3), the dynamic 
causal effect is the sequence of coefficients 64, Bo, ..., B, +1. 


Implications for empirical time series analysis. This formulation of dynamic causal 
effects in time series data as the expected outcome of an experiment in which differ- 
ent treatment levels are repeatedly applied to the same subject has two implications 
for empirical attempts to measure the dynamic causal effect with observational time 
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series data. The first implication is that the dynamic causal effect should not change 
over the sample on which we have data. This in turn is implied by the data being 
jointly stationary (Key Concept 15.3). As discussed in Section 15.7, the hypothesis 
that a population regression function is stable over time can be tested using the 
Quandt likelihood ratio (QLR) test for a break, and it is possible to estimate the 
dynamic causal effect in different subsamples. The second implication is that X must 
be uncorrelated with the error term, and it is to this implication that we now turn. 


Two Types of Exogeneity 


Section 12.1 defined an exogenous variable as a variable that is uncorrelated with the 
regression error term and an endogenous variable as a variable that is correlated with 
the error term. This terminology traces to models with multiple equations, in which 
an endogenous variable is determined within the model, while an exogenous variable 
is determined outside the model. Loosely speaking, if we are to estimate dynamic 
causal effects using the distributed lag model in Equation (16.3), the regressors (the 
X’s) must be uncorrelated with the error term. Thus X must be exogenous. Because 
we are working with time series data, however, we need to refine the definitions of 
exogeneity. In fact, there are two different concepts of exogeneity that we use here. 

The first concept of exogeneity is that the error term has a conditional mean of 0 
given current and all past values of X,—that is, that E(u, | X, X -1 X;-2,...) = 0. 
This modifies the standard conditional mean assumption for multiple regression with 
cross-sectional data (assumption 1 in Key Concept 6.4), which requires only that u, 
have a conditional mean of 0 given the included regressors—that is, 
E(u,| X», X;-1,..., X-r) = 0. Including all lagged values of X, in the conditional 
expectation implies that all the more distant causal effects—all the causal effects 
beyond lag r—are 0. Thus, under this assumption, the r distributed lag coefficients in 
Equation (16.3) constitute all the nonzero dynamic causal effects. We can refer to this 
assumption—that E(u, | X, X;-1,...) = 0—as past and present exogeneity, but 
because of the similarity of this definition and the definition of exogeneity in Chap- 
ter 12 , we just use the term exogeneity. 

The second concept of exogeneity is that the error term has mean 0 given all 
past, present, and future values of X,—that is, that E(u,|..., X10, X41, Xp 
X,-1, X;-2, ...) = 0. This is called strict exogeneity; for clarity, we also call it past, 
present, and future exogeneity. The reason for introducing the concept of strict exo- 
geneity is that, when X is strictly exogenous, there are more efficient estimators of 
dynamic causal effects than the OLS estimators of the coefficients of the distributed 
lag regression in Equation (16.3). 

The difference between exogeneity (past and present) and strict exogeneity 
(past, present, and future) is that strict exogeneity includes future values of X in the 
conditional expectation. Thus strict exogeneity implies exogeneity but not the 
reverse. One way to understand the difference between the two concepts is to con- 
sider the implications of these definitions for correlations between X and u. If X is (past 
and present) exogenous, then u, is uncorrelated with current and past values of X,. 
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The Distributed Lag Model and Exogeneity 
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In the distributed lag model 
Y= Bot Bik, bP ee Bites GX LOS) 
there are two different types of exogeneity— that is, two different exogeneity conditions: 
e Past and present exogeneity (exogeneity): 
E(u,| X,, X-1, X—-2, ...) = 0; and (16.5) 
e Past, present, and future exogeneity (strict exogeneity): 


EG in) ee, Xe b X; mo = (0) (16.6) 


If X is strictly exogenous, it is exogenous, but exogeneity does not imply strict 
exogeneity. 


If X is strictly exogenous, then in addition u, is uncorrelated with future values of X, 
For example, if a change in Y, causes future values of X, to change, then X, is not 
strictly exogenous even though it might be (past and present) exogenous. 

As an illustration, consider the hypothetical multiyear tomato/fertilizer experiment 
described following Equation (16.3). Because the fertilizer is randomly applied in the hypo- 
thetical experiment, it is exogenous. Because tomato yield today does not depend on the 
amount of fertilizer applied in the future, the fertilizer time series is also strictly exogenous. 

As a second illustration, consider the orange juice price example, in which Y, is 
the monthly percentage change in orange juice prices and X, is the number of freez- 
ing degree days in that month. From the perspective of orange juice markets, we can 
think of the weather—the number of freezing degree days—as if it were randomly 
assigned in the sense that the weather is outside human control. If the effect of FDD 
is linear and if it has no effect on prices after r months, then it follows that the 
weather is exogenous. But is the weather strictly exogenous? If the conditional mean 
of u, given future FDD is nonzero, then FDD is not strictly exogenous. Answering 
this question requires thinking carefully about what, precisely, is contained in u,. In 
particular, if orange juice market participants use forecasts of FDD when they decide 
how much they will buy or sell at a given price, then orange juice prices, and thus the 
error term u, could incorporate information about future FDD that would make u, a 
useful predictor of FDD. This means that u, will be correlated with future values of 
FDD,. According to this logic, because u, includes forecasts of future Florida weather, 
FDD would be (past and present) exogenous but not strictly exogenous. The difference 
between this and the tomato/fertilizer example is that, while tomato plants are unaf- 
fected by future fertilization, orange juice market participants are influenced by fore- 
casts of future Florida weather. We return to the question of whether FDD is strictly 
exogenous when we analyze the orange juice price data in more detail in Section 16.6. 

The two definitions of exogeneity are summarized in Key Concept 16.1. 


16.3 


16.3 Estimation of Dynamic Causal Effects with Exogenous Regressors 617 


Estimation of Dynamic Causal Effects 
with Exogenous Regressors 


If X is exogenous, then its dynamic causal effect on Y can be estimated by OLS esti- 
mation of the distributed lag regression in Equation (16.4). This section summarizes 
the conditions under which these OLS estimators lead to valid statistical inferences 
and introduces dynamic multipliers and cumulative dynamic multipliers. 


The Distributed Lag Model Assumptions 


The four assumptions of the distributed lag regression model are similar to the four 
assumptions for the cross-sectional multiple regression model (Key Concept 6.4), but 
they have been modified for time series data. 

The first assumption is that X is exogenous, which extends the 0 conditional 
mean assumption for cross-sectional data to include all lagged values of X. As dis- 
cussed in Section 16.2, this assumption implies that the r distributed lag coefficients 
in Equation (16.3) constitute all the nonzero dynamic causal effects. In this sense, the 
population regression function summarizes the entire dynamic effect on Y of a 
change in X. 

The second assumption has two parts: Part (a) requires that the variables have a 
stationary distribution, and part (b) requires that they become independently distrib- 
uted when the amount of time separating them becomes large. This assumption is the 
same as the corresponding assumption for the ADL model (the second assumption 
in Key Concept 15.6), and the discussion of that assumption in Section 15.4 applies 
here as well. 

The third assumption is that large outliers are unlikely, made mathematically 
precise by assuming that the variables have more than eight nonzero finite moments. 
This is stronger than the assumption of four finite moments that is used elsewhere in 
this text. As discussed in Section 16.4, this stronger assumption is used in the math- 
ematics behind the HAC variance estimator. 

The fourth assumption, which is the same as that in the cross-sectional multiple 
regression model, is that there is no perfect multicollinearity. 

The distributed lag regression model assumptions are summarized in Key Con- 
cept 16.2. 


Extension to additional X’s. The distributed lag model extends directly to multiple 
X’s: The additional X’s and their lags are simply included as regressors in the distrib- 
uted lag regression, and the assumptions in Key Concept 16.2 are modified to include 
these additional regressors. Although the extension to multiple X’s is conceptually 
straightforward, it complicates the notation, obscuring the main ideas of estimation 
and inference in the distributed lag model. For this reason, the case of multiple X’s 
is not treated explicitly in this chapter but is left as a straightforward extension of the 
distributed lag model with a single X. 
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The distributed lag model is given in Key Concept 16.1 [Equation (16.4)], where 
Bi» Bo, - - - , Br+1 are dynamic causal effects and 


1. X is exogenous; that is, E(u,| X,, X;-1, X;-2, ...) = 0; 


2. (a) The random variables Y,and X, have a stationary distribution, and 
(b) (Y, X;) and (Y, 


3. Large outliers are unlikely: Y, and X, have more than eight nonzero finite 


-j X;-;) become independent as j gets large; 


moments; and 


4. There is no perfect multicollinearity. 


Autocorrelated u, Standard Errors, and Inference 


In the distributed lag regression model, the error term u, can be autocorrelated; that 
is, u, can be correlated with its lagged values. This autocorrelation arises because, in 
time series data, the omitted factors included in u, can themselves be serially corre- 
lated. For example, suppose that the demand for orange juice also depends on 
income, so one factor that influences the price of orange juice is income — specifically, 
the aggregate income of potential orange juice consumers. Then aggregate income is 
an omitted variable in the distributed lag regression of orange juice price changes 
against freezing degree days. Aggregate income, however, is serially correlated: 
Income tends to fall in recessions and rise in expansions. Thus income is serially cor- 
related, and because it is part of the error term, u, will be serially correlated. This 
example is typical: Because omitted determinants of Y are themselves serially cor- 
related, in general u, in the distributed lag model will be serially correlated. 

The autocorrelation of u, does not affect the consistency of OLS, nor does it intro- 
duce bias. If, however, the errors are autocorrelated, then, in general, the usual OLS 
standard errors are inconsistent, and a different formula must be used. Thus serial cor- 
relation of the errors is analogous to heteroskedasticity: The homoskedasticity-only 
standard errors are “wrong” when the errors are, in fact, heteroskedastic in the sense 
that using homoskedasticity-only standard errors results in misleading statistical infer- 
ences when the errors are heteroskedastic. Similarly, when the errors are serially cor- 
related, standard errors predicated on independently and identically distributed (i.i.d.) 
errors are “wrong” in the sense that they result in misleading statistical inferences. The 
solution to this problem is to use HAC standard errors, the topic of Section 16.4. 


Dynamic Multipliers and Cumulative Dynamic Multipliers 

Another name for the dynamic causal effect is the dynamic multiplier. The cumulative 
dynamic multipliers are the cumulative causal effects, up to a given lag; thus the cumu- 
lative dynamic multipliers measure the cumulative effect on Y of a change in X. 
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Dynamic multipliers. The effect of a unit change in X on Y after h periods, which is 
Bn+1 in Equation (16.4), is called the h-period dynamic multiplier. Thus the dynamic 
multipliers relating X to Y are the coefficients on X, and its lags in Equation (16.4). For 
example, 62 is the one-period dynamic multiplier, 8; is the two-period dynamic multiplier, 
and so forth. In this terminology, the zero-period (or contemporaneous) dynamic multi- 
plier, or impact effect, is 64, the effect on Y of a change in X in the same period. 
Because the dynamic multipliers are estimated by the OLS regression coefficients, 
their standard errors are the HAC standard errors of the OLS regression coefficients. 


Cumulative dynamic multipliers. The h-period cumulative dynamic multiplier is the 
cumulative effect of a unit change in X on Y over the next h periods. Thus the cumulative 
dynamic multipliers are the cumulative sum of the dynamic multipliers. In terms of 
the coefficients of the distributed lag regression in Equation (16.4), the zero-period 
cumulative multiplier is 64, the one-period cumulative multiplier is 6&1 + fz, and the 
h-period cumulative dynamic multiplier is B, + B + ++- + 8,41. The sum of all the 
individual dynamic multipliers, 8B; + B) + -+> + B,+1,is the cumulative long-run effect 
on Y ofa change in X and is called the long-run cumulative dynamic multiplier. 

For example, consider the regression in Equation (16.2). The immediate effect of 
an additional freezing degree day is that the price of orange juice concentrate rises 
by 0.47%.The cumulative effect of a price change over the next month is the sum of 
the impact effect and the dynamic effect one month ahead; thus the cumulative effect 
on prices is the initial increase of 0.47% plus the subsequent smaller increase of 
0.14%, for a total of 0.61%. Similarly, the cumulative dynamic multiplier over two 
months is 0.47% + 0.14% + 0.06% = 0.67%. 

The cumulative dynamic multipliers can be estimated directly using a modifica- 
tion of the distributed lag regression in Equation (16.4). This modified regression is 


Y, = ôo + AX, + &AX, -1 + 63AX-2 +++ + AX -rti + 6,41X-- + Uy 
(16.7) 


The coefficients in Equation (16.7), 6, d:,..., 6,41, are, in fact, the cumulative 
dynamic multipliers. This can be shown by a bit of algebra (Exercise 16.5), which 
demonstrates that the population regressions in Equations (16.7) and (16.4) are 
equivalent, where 6) = Bo, 6; = Bi, 62 = By + Bo, 63 = By + Bo + Bs, and so forth. 
The coefficient on X;_,, 6,41, is the long-run cumulative dynamic multiplier; that is, 
6-41 = By + B + B +--+: + B41. Moreover, the OLS estimators of the coeffi- 
cients in Equation (16.7) are the same as the corresponding cumulative sum of the 


OLS estimators in Equation (16.4). For example, ô = ĝi + By. The main benefit of 
estimating the cumulative dynamic multipliers using the specification in Equa- 
tion (16.7) is that, because the OLS estimators of the regression coefficients are 
estimators of the cumulative dynamic multipliers, the HAC standard errors of the 
coefficients in Equation (16.7) are the HAC standard errors of the cumulative 
dynamic multipliers. 
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16.4 


Heteroskedasticity- and Autocorrelation- 
Consistent Standard Errors 


If the error term uis autocorrelated, then OLS coefficient estimators are consistent, but 
in general the usual OLS standard errors for cross-sectional data are not. This means that 
conventional statistical inferences—hypothesis tests and confidence intervals— based on 
the usual OLS standard errors will, in general, be misleading. For example, confidence 
intervals constructed as the OLS estimator + 1.96 conventional standard errors need not 
contain the true value in 95% of repeated samples, even if the sample size is large. This 
section begins with a derivation of the correct formula for the variance of the OLS esti- 
mator with autocorrelated errors and then turns to HAC standard errors. 

This section covers HAC standard errors for regression with time series data. 
Chapter 10 introduced a type of HAC standard errors, clustered standard errors, that 
are appropriate for panel data. Although clustered standard errors for panel data and 
HAC standard errors for time series data have the same goal, the different data 
structures lead to different formulas. This section is self-contained, and Chapter 10 is 
not a prerequisite. 


Distribution of the OLS Estimator 
with Autocorrelated Errors 


To keep things simple, consider the OLS estimator Ĝi in the distributed lag regression 
model with no lags —that is, the linear regression model with a single regressor X; 


Y, = Bo + BX, + u, (16.8) 


where the assumptions of Key Concept 16.2 are satisfied. This section shows that the 
variance of Bi can be written as the product of two terms: the expression for var(1), 
applicable if u, is not serially correlated, multiplied by a correction factor that arises 
from the autocorrelation in u, or, more precisely, the autocorrelation in (X, — wy)u;. 

As shown in Appendix 4.3, the formula for the OLS estimator By in Key Concept 4.2 
can be rewritten as 


12 = 
T & (X - Xu 


t 


T-I) 


Êi =B + ; (16.9) 


ll 
un 


where Equation (16.9) is Equation (4.28) with a change of notation so that i and n 
are replaced by t and T. Because X —> py and ty (% - XY > o%, in large 
samples Bi — Bris approximately given by 


be T% 
TX — py), TS Vt 5 
2 t= t= 
Bi - Bi = 3 = 2 = oe (16.10) 


ox ox Ox 
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where v, = (X, — uy)u, and Y = +>) 7.1 v, Thus 


(Êi) ( i ) ay) (16.11) 
var(B,) = var| — } = ; ; 
ox) (oxy 

If v, is i.i.d.—as assumed for cross-sectional data in Key Concept 4.3—then 
var(v) = var(v,)/ T, and the formula for the variance of B, from Key Concept 4.4 
applies. If, however, u, and X, are not independently distributed over time, then, in 
general, v; will be serially correlated, so var(V) # var(v,)/T and Key Concept 4.4 
does not apply. Instead, if v, is serially correlated, the variance of V is given by 


var(V) = var[(vy + v2 +--+ + vr)/T] 
= [var(vı) + cov(yy, v2) + +++ + cov(y, vr) 


+ cov(vz, vj) + var(v2) + +++ + var(vr)]/T? 


16.12 
= [Tvar(v, + 2(T — 1)cov(v, v1) ( ) 
+ 2(T — 2)cov(y, vi-2) + +++ + 2cov(v, v-r+1)|/T? 
oy 
= ir 
where 
T-1/ T=} 
r=14+23 (H) (16.13) 
AAT 


where p; = corr(v, v;,—;). In large samples, fr tends to the limit, fr —> fs = 
1 +25 p . 

Combining the expressions in Equation (16.10) for 6, and Equation (16.12) for 
var(V) gives the formula for the variance of B, when v, is autocorrelated: 


2 
T “6% 


T (ax) 


var(B,) = | | fr, (16.14) 
where fris given in Equation (16.13). 

Equation (16.14) expresses the variance of Ĝĝ; as the product of two terms. The 
first, in square brackets, is the formula for the variance of Bi given in Key Concept 4.4, 
which applies in the absence of serial correlation. The second is the factor fr, which 
adjusts this formula for serial correlation. Because of this additional factor frin 
Equation (16.14), the usual OLS standard error computed using Equation (5.4) is 
incorrect if the errors are serially correlated: If v, = (X, — wy)u;is serially correlated, 
the estimator of the variance is off by the factor fr. 


HAC Standard Errors 


If the factor fr, defined in Equation (16.13), were known, then the variance of Bi 
could be estimated by multiplying the usual cross-sectional estimator of the variance 
by fr. This factor, however, depends on the unknown autocorrelations of v, so it must 
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be estimated. The estimator of the variance of Â that incorporates this adjustment is 
consistent whether or not there is heteroskedasticity and whether or not v, is autocor- 
related. Accordingly, this estimator is called the heteroskedasticity- and autocorrelation- 
consistent (HAC) estimator of the variance of Bi and the square root of the HAC 
variance estimator is the HAC standard error of Bi. 


The HAC variance formula. The HAC estimator of the variance of Ê; is 
A (16.15) 


where ô, is the estimator of the variance of Bi in the absence of serial correla- 
tion, given in Equation (5.4), and where fr is an estimator of the factor fr in 
Equation (16.13). 

The task of constructing a consistent estimator Îr is challenging. To see why, 
consider two extremes. At one extreme, given the formula in Equation (16.13), it 
might seem natural to replace the population autocorrelations p; with the sample 
autocorrelations p; [defined in Equation (15.5)], yielding the estimator 
TE2X a (A ) p;. But this estimator contains so many estimated autocorrelations 
that it is inconsistent. Intuitively, because each of the estimated autocorrela- 
tions contains an estimation error, by estimating so many autocorrelations the esti- 
mation error in this estimator of fr remains large even in large samples. At the other 
extreme, one could imagine using only a few sample autocorrelations—for example, 
using only the first sample autocorrelation and ignoring all the higher autocorrela- 
tions. Although this estimator eliminates the problem of estimating too many auto- 
correlations, it has a different problem: It is inconsistent because it ignores the 
additional autocorrelations that appear in Equation (16.13). In short, using too many 
sample autocorrelations makes the estimator have a large variance, but using too few 
autocorrelations ignores the autocorrelations at higher lags, so in either of these 
extreme cases the estimator is inconsistent. 

Estimators of fr used in practice strike a balance between these two extreme 
cases by choosing the number of autocorrelations to include in a way that depends 
on the sample size 7. If the sample size is small, only a few autocorrelations are used, 
but if the sample size is large, more autocorrelations are included (but still far fewer 
than T). Specifically, let frbe given by 


m-1 =f 
fr=1+25,(* a (16.16) 
= 


m 


where p; = D/=j+1 0:%,-;/D/-197, where 0, = (X, — X)ii, (as in the definition of 
h). The parameter m in Equation (16.16) is called the truncation parameter of the 
HAC estimator because the sum of autocorrelations is shortened, or truncated, to 
include only m — 1 autocorrelations instead of the T — 1 autocorrelations appearing 
in the population formula in Equation (16.13). 
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For fy to be consistent, m must be chosen so that it is large in large samples, 
although still much less than T. One guideline for choosing m in practice is to use the 
formula 


m = 0.75T "°, (16.17) 


rounded to an integer. This formula, which is based on the assumption that there is 
at most a moderate amount of autocorrelation in v, gives a benchmark rule for deter- 
mining m as a function of the number of observations in the regression.! 

The value of the truncation parameter m resulting from Equation (16.17) can be 
modified using your knowledge of the series at hand. On the one hand, if there is a 
great deal of serial correlation in v, then you should increase m beyond the value 
from Equation (16.17). On the other hand, if v, has little serial correlation, you could 
decrease m. Because of the ambiguity associated with the choice of m, it is good 
practice to try one or two alternative values of m for at least one specification to 
make sure your results are not sensitive to m. 

The HAC estimator in Equation (16.15), with Îr given in Equation (16.16), is 
called the Newey—West variance estimator, after the econometricians Whitney 
Newey and Kenneth West, who proposed it. They showed that, when used along with 
a rule like that in Equation (16.17), under general assumptions this estimator is a 
consistent estimator of the variance of ĝi (Newey and West 1987). Their proofs (and 
those in Andrews 1991) assume that v, has more than four moments, which in turn is 
implied by X, and u, having more than eight moments, and this is the reason that the 
third assumption in Key Concept 16.2 is that X, and u, have more than eight moments. 


Other HAC estimators. The Newey—West variance estimator is not the only HAC 
estimator. For example, the weights (m — j)/m in Equation (16.16) can be replaced 
by different weights. If different weights are used, then the rule for choosing the 
truncation parameter in Equation (16.17) no longer applies, and a different rule, 
developed for those weights, should be used instead. Discussion of HAC estimators 
using other weights goes beyond the scope of this text. For more information on this 
topic, see Hayashi (2000, Section 6.6). 


Extension to multiple regression. All the issues discussed in this section generalize 
to the distributed lag regression model in Key Concept 16.1 with multiple lags and, 
more generally, to the multiple regression model with serially correlated errors. In 
particular, if the error term is serially correlated, then the usual OLS standard errors 
are an unreliable basis for inference, and HAC standard errors should be used instead. 
If the HAC variance estimator used is the Newey—West estimator [the HAC variance 
estimator based on the weights (m — j)/m], then the truncation parameter m can be 


Equation (16.17) gives the value of m that minimizes E, — o% Y when u, and X, are first-order autore- 


gressive processes with first autocorrelation coefficient 0.5. Equation (16.17) is based on a more general 
formula derived by Andrews [1991, Equation (5.3)]. 


624 CHAPTER 16 Estimation of Dynamic Causal Effects 


16.3 


16.5 


HAC Standard Errors 


The problem: The error term u, in the distributed lag regression model in Key 
Concept 16.1 can be serially correlated. If so, the OLS coefficient estimators are 
consistent, but, in general, the usual OLS standard errors are not, resulting in 
misleading hypothesis tests and confidence intervals. 


The solution: Standard errors should be computed using a HAC estimator of 
the variance. The HAC estimator involves estimates of m — 1 autocorrelations 
as well as the variance; in the case of a single regressor, the relevant formulas are 
given in Equations (16.15) and (16.16). 

In practice, using HAC standard errors entails choosing the truncation 
parameter m. To do so, use the formula in Equation (16.17) as a benchmark and 
then increase or decrease m, depending on whether your regressors and errors 
have high or low serial correlation. 


chosen according to the rule in Equation (16.17) whether there is a single regres- 
sor or multiple regressors. The formula for HAC standard errors in multiple 
regression is incorporated into modern regression software designed for use with 
time series data. Because this formula involves matrix algebra, we omit it here 
and instead refer the reader to Hayashi (2000, Section 6.6) for the mathematical 
details. 


HAC standard errors are summarized in Key Concept 16.3. 


Estimation of Dynamic Causal Effects 
with Strictly Exogenous Regressors 


When X, is strictly exogenous, two alternative estimators of dynamic causal effects 
are available. The first such estimator involves estimating an ADL model instead of 
a distributed lag model and calculating the dynamic multipliers from the estimated 
ADL coefficients. This method can entail estimating fewer coefficients than OLS 
estimation of the distributed lag model, thus potentially reducing estimation error. 
The second method is to estimate the coefficients of the distributed lag model, using 
generalized least squares (GLS) instead of OLS. Although GLS estimates the same 
number of coefficients in the distributed lag model as OLS, the GLS estimator has a 
smaller variance. To keep the exposition simple, these two estimation methods are 
laid out and discussed in the context of a distributed lag model with a single lag and 
AR(1) errors. Appendix 16.2 extends these estimators to the general distributed lag 
model with higher-order autoregressive errors. 
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The Distributed Lag Model with AR(1) Errors 


Suppose that the causal effect on Y of a change in X lasts for only two periods; that 
is, it has an initial impact effect 8, and an effect in the next period of fz but no effect 
thereafter. Then the appropriate distributed lag regression model is the distributed 
lag model with only current and past values of X,_ : 


Y, = Bo + BX, + PoX—-1 + up (16.18) 


As discussed in Section 16.2, in general the error term u, in Equation (16.18) is seri- 
ally correlated. One consequence of this serial correlation is that, if the distributed lag 
coefficients are estimated by OLS, then inference based on the usual OLS standard 
errors can be misleading. For this reason, Sections 16.3 and 16.4 emphasized the use of 
HAC standard errors when £; and $, in Equation (16.18) are estimated by OLS. 

In this section, we take a different approach toward the serial correlation in u, 
This approach, which is possible if X, is strictly exogenous, involves adopting an 
autoregressive model for the serial correlation in u, and then using this AR model to 
derive estimators that can be more efficient than OLS. 

Specifically, suppose that u, follows the AR(1) model 


uy = Piu;—1 + Up (16.19) 


where ġ; is the autoregressive parameter, w, is serially uncorrelated, and no intercept 
is needed because E(u,) = 0. Equations (16.18) and (16.19) imply that the distrib- 
uted lag model with a serially correlated error can be rewritten as an autoregressive 
distributed lag model with a serially uncorrelated error. To do so, lag each side of 
Equation (16.18), and subtract @; multiplied by this lag from each side: 


Y, — b1Y,-1 = (Bo + BX + BoX-1 + u) — $1(Bo + BiXi-1 + BoX—-2 + Uy-1) 
= Bo + BX, + B.X-1 — b1Bo — b181:X-1 — b1B.X;-2 + uy, (16.20) 


where the second equality uses u, = u, — $4u,—;. Collecting terms in Equation (16.20), 
we have that 


Y, = a + i Y-1 + oX, + 6X1 + O.X—-2 + Uy (16.21) 
where 


ay = Bol — 1), 59 = Bi, 6; = Bo — $16, and 6, = —d4fh, (16.22) 


where 6p, 61, and £; are the coefficients in Equation (16.18) and œ; is the autocorrela- 
tion coefficient in Equation (16.19). 
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Equation (16.21) is an ADL model that includes a contemporaneous value of X 
and two of its lags. We will refer to Equation (16.21) as the ADL representation of the 
distributed lag model with autoregressive errors given in Equations (16.18) and (16.19). 

The terms in Equation (16.20) can be reorganized differently to obtain an expression 
that is equivalent to Equations (16.21) and (16.22). Let Y, = Y, — Y,_ be the quasi- 
difference of Y, (quasi because it is not the first difference, the difference between Y, and 
Y,--1;rather, it is the difference between Y, and ¢  Y;_ 1). Similarly, let x, = X — $1X-1 
be the quasi-difference of X,. Then Equation (16.20) can be written 


Y, = ao + BX, + BX, + ti, (16.23) 


We will refer to Equation (16.23) as the quasi-difference representation of the dis- 
tributed lag model with autoregressive errors given in Equations (16.18) and (16.19). 

The ADL model in Equation (16.21) [with the parameter restrictions in Equa- 
tion (16.22)] and the quasi-difference model in Equation (16.23) are equivalent. In 
both models, the error term, %, is serially uncorrelated. The two representations, 
however, suggest different estimation strategies. But before discussing those strate- 
gies, we turn to the assumptions under which they yield consistent estimators of the 
dynamic multipliers, 64 and 6z. 


The conditional mean 0 assumption in the ADL and quasi-difference models. Because 
Equations (16.21) [with the restrictions in Equation (16.22)] and (16.23) are equiva- 
lent, the conditions for their estimation are the same, so for convenience we consider 
Equation (16.23). 

The quasi-difference model in Equation (16.23) is a distributed lag model involv- 
ing the quasi-differenced variables with a serially uncorrelated error. Accordingly, the 
conditions for OLS estimation of the coefficients in Equation (16.23) are the least 
squares assumptions for the distributed lag model in Key Concept 16.2, expressed in 
terms of i, and_X,.The critical assumption here is the first assumption, which, applied 
to Equation (16.23), is that Æ, is exogenous; that is, 


BGAN, X1...) =O, (16.24) 


where letting the conditional expectation depend on distant lags of X, ensures that 
no additional lags of X,, other than those appearing in Equation (16.23), enter the 
population regression function. 

Because x = X, — $,X,_1,80 X, = x + ,X,_1, conditioning on x, and all of 
its lags is equivalent to conditioning on X, and all of its lags. Thus the conditional 
expectation condition in Equation (16.24) is equivalent to the condition that 
E(u,|X,, X;-1,...) = 0. Furthermore, because ù, = u, — @,u,_1, this condition in 
turn implies that 


O= BGi,|X, Mayes) 
= E(u, — $44 ;-1|X, X- ---) (16.25) 
= E(u,|X, X1,...) — bB(u-1|X, Xa» --)- 
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For the equality in Equation (16.25) to hold for general values of p4, it must be 
the case that both E(u,| X, X;-1,...) = 0 and E(u;,—;|X, X-1, ...) = 0. By 
shifting the time subscripts forward one time period, the condition that 
E(u,-; | X, X;-1, ....) = 0 can be rewritten as 


E(u; | Xiri Xp X-1, raa ) =, (16.26) 


which (by the law of iterated expectations) implies that E(u, | X, X,-1,...) = 0. 
In summary, having the 0 conditional mean assumption in Equation (16.24) hold 
for general values of œ; is equivalent to having the condition in Equation (16.26) 
hold. 

The condition in Equation (16.26) is implied by X, being strictly exogenous, but it 
is not implied by X, being (past and present) exogenous. Thus the least squares assump- 
tions for estimation of the distributed lag model in Equation (16.23) hold if X, is strictly 
exogenous, but it is not enough that X, be (past and present) exogenous. 

Because the ADL representation [Equations (16.21) and (16.22)] is equivalent 
to the quasi-differenced representation [Equation (16.23)], the conditional mean 
assumption needed to estimate the coefficients of the quasi-differenced representa- 
tion [that E(u, | X,.1, Xp X;-1, ....) = O] is also the conditional mean assumption for 
consistent estimation of the coefficients of the ADL representation. 

We now turn to the two estimation strategies suggested by these two representations: 
estimation of the ADL coefficients and estimation of the coefficients of the quasi- 
difference model. 


OLS Estimation of the ADL Model 


The first strategy is to use OLS to estimate the coefficients in the ADL model in 
Equation (16.21). As the derivation leading to Equation (16.21) shows, including the 
lag of Y and the extra lag of X as regressors makes the error term serially uncorre- 
lated (under the assumption that the error follows a first-order autoregression). Thus 
the usual OLS standard errors can be used; that is, HAC standard errors are not 
needed when the ADL model coefficients in Equation (16.21) are estimated by OLS. 

The estimated ADL coefficients are not themselves estimates of the dynamic 
multipliers, but the dynamic multipliers can be computed from the ADL coefficients. 
A general way to compute the dynamic multipliers is to express the estimated regres- 
sion function as a function of current and past values of X,—that is, to eliminate Y, 
from the estimated regression function. To do so, repeatedly substitute expressions 
for lagged values of Y, into the estimated regression function. Specifically, consider 
the estimated regression function 


Ê = biY-1 + oX, + iX -1 + 8X2, (16.27) 


where the estimated intercept has been omitted because it does not enter any expres- 
sion for the dynamic multipliers. Lagging both sides of Equation (16.27) yields 
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Ê = bY. + X + X-a + 8,X_3, 80 replacing Ê_ in Equation (16.27) 
by this expression for Y,_, and collecting terms yields 


y, = bi(biY,-2 F oX, -1 + SX + êX, —3) + 5X, + TEAR F êX,—2 
= 8oX, + (81 + 1o) X1 + (82 + 6181) X,-2 + $18.X,-3 + QTY,- (16.28) 


Repeating this process by repeatedly substituting expressions for Y,_>, Y,—3, and so 
forth yields 


Ê = X, + (ê; + $150) X,-1 + (Sy + 615; + $780) X,-2 
+ (82 + 6151 + 100) X,-3 + Pilz + $181 + P15) X-4 +--+. (16.29) 


The coefficients in Equation (16.29) are the estimators of the dynamic multipliers, 
computed from the OLS estimators of the coefficients in the ADL model in Equa- 
tion (16.21). If the restrictions on the coefficients in Equation (16.22) were to hold 
exactly for the estimated coefficients, then the dynamic multipliers beyond the sec- 
ond (that is, the coefficients on X,_5, X,_3, and so forth) would all be 0.2 However, 
under this estimation strategy those restrictions will not hold exactly, so the esti- 
mated multipliers beyond the second in Equation (16.29) will generally be 
nonzero. 


GLS Estimation 


The second strategy for estimating the dynamic multipliers when X, is strictly exogenous 
is to use generalized least squares (GLS), which entails estimating Equation (16.23). 
To describe the GLS estimator, we initially assume that ¢; is known. Because in 
practice it is unknown, this estimator is infeasible, so it is called the infeasible GLS 
estimator. The infeasible GLS estimator, however, can be modified using an estima- 
tor of $4, which yields a feasible version of the GLS estimator. 


Infeasible GLS. If ġı is known, then the quasi-differenced variables X, and Y, can 
be computed directly. As discussed in the context of Equations (16.24) and 
(16.26), if X, is strictly exogenous, then E(i,| Xp, X1, ...) = 0. Thus, if X, is 
strictly exogenous and if ¢; is known, the coefficients ap, 61, and f, in Equation 
(16.23) can be estimated by the OLS regression of Y, on X, and X,_, (including 
an intercept). The resulting estimator of £, and B)—that is, the OLS estimator of 
the slope coefficients in Equation (16.23) when ¢, is known—is the infeasible 
GLS estimator. This estimator is infeasible because in reality ¢, is unknown, so 
X, and Y, cannot be computed and thus these OLS estimators cannot actually be 
computed. 


Substitute the equalities in Equation (16.22) to show that, if those equalities hold, then 
5, + 15, + bdo = 0. 
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Feasible GLS. The feasible GLS estimator modifies the infeasible GLS estimator by 
using a preliminary estimator of ¢,, $y, to compute the estimated quasi-differences. 
Specifically, the feasible GLS estimators of 8; and Bo are the OLS estimators of 6, 
and fin Equation (16. 23), computed, by regressing Y, on x, and x, 1 (with an inter- 
cept), where X, = X, — $1X,-, and Y, = Y, — @1Y,-1. 

The preliminary estimator, i, can be computed by first estimating the distrib- 
uted lag regression in Equation (16.18) by OLS and then using OLS to estimate ¢, 
in Equation (16.19) with the OLS residuals ĉ, replacing the unobserved regression 
errors u, This version of the GLS estimator is called the Cochrane—Orcutt (1949) 
estimator. 

An extension of the Cochrane—Orcutt method is to continue this process itera- 
tively: Use the GLS estimate of 8; and $, to compute revised estimates of u,; use 
these new residuals to reestimate ,; use this revised estimate of 4 to compute 
revised estimated quasi-differences; use these revised estimated quasi-differences to 
reestimate 6, and 6; and continue this process until the estimates of 6, and B, con- 
verge. This is referred to as the iterated Cochrane—Orcutt estimator. 


Efficiency of GLS. The virtue of the GLS estimator is that when X is strictly exoge- 
nous and the transformed errors %, are homoskedastic, it is efficient among linear 
estimators, at least in large samples. To see this, first consider the infeasible GLS 
estimator. If wz, is homoskedastic, if 6; is known (so that x and Y, can be treated as 
if they are observed), and if X, is strictly exogenous, then the Gauss- Markov theo- 
rem implies that the OLS estimator of ao, B,, and $ in Equation (16.23) is efficient 
among all linear conditionally unbiased estimators based on X, and Y,, for 
t = 2,..., T, where the first observation (t = 1) is lost because of quasi-differencing. 
That is, the OLS estimator of the coefficients in Equation (16.23) is the best linear unbi- 
ased estimator, or BLUE (Section 5.5). Because the OLS estimator of Equation (16.23) 
is the infeasible GLS estimator, this means that the infeasible GLS estimator is 
BLUE. The feasible GLS estimator is similar to the infeasible GLS estimator except 
that ¢, is estimated. Because the estimator of # is consistent and its variance is 
inversely proportional to T, the feasible and infeasible GLS estimators have the same 
variances in large samples, and the loss of information from the first observation 
(t = 1) is negligible when T is large. In this sense, if X is strictly exogenous, then the 
feasible GLS estimator is BLUE in large samples. In particular, if X is strictly exog- 
enous, then GLS is more efficient than the OLS estimator of the distributed lag coef- 
ficients discussed in Section 16.3. 

The Cochrane—Orcutt and iterated Cochrane—Orcutt estimators presented here 
are special cases of GLS estimation. In general, GLS estimation involves transform- 
ing the regression model so that the errors are homoskedastic and serially uncorre- 
lated and then estimating the coefficients of the transformed regression model by 
OLS. In general, the GLS estimator is consistent and BLUE in large samples if X is 
strictly exogenous, but it is not consistent if X is only (past and present) exogenous. 
The mathematics of GLS involves matrix algebra, so it is postponed to Section 19.6. 
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16.6 


Orange Juice Prices and Cold Weather 


This section uses the tools of time series regression to squeeze additional insights 
from our data on Florida temperatures and orange juice prices. First, how long lasting 
is the effect of a freeze on the price? Second, has this dynamic effect been stable, or 
has it changed over the 51 years spanned by the data and, if so, how? 

We begin this analysis by estimating the dynamic causal effects using the method 
of Section 16.3—that is, by OLS estimation of the coefficients of a distributed lag 
regression of the percentage change in prices (%ChgP,) on the number of freezing 
degree days in that month (FDD,) and its lagged values. For the distributed lag esti- 
mator to be consistent, FDD must be (past and present) exogenous. As discussed in 
Section 16.2, this assumption is reasonable here. Humans cannot influence the 
weather, so treating the weather as if it were randomly assigned experimentally is 
appropriate as a working hypothesis (we return to this below). If FDD is exogenous, 
we can estimate the dynamic causal effects by OLS estimation of the coefficients in 
the distributed lag model of Equation (16.4) in Key Concept 16.1. 

As discussed in Sections 16.3 and 16.4, the error term can be serially correlated 
in distributed lag regressions, so it is important to use HAC standard errors, which 
adjust for this serial correlation. For the initial results, the truncation parameter for 
the Newey—West standard errors (m in the notation of Section 16.4) was chosen using 
the rule in Equation (16.17): Because there are 612 monthly observations, according 
to that rule m = 0.75 T! = 0.75 X 612!/3 = 6.37, but because m must be an inte- 
ger, this was rounded up to m = 7. The sensitivity of the standard errors to this 
choice of truncation parameter is investigated below. 

The results of OLS estimation of the distributed lag regression of %ChgP, on FDD, 
FDD,-,, . . .,FDD,-1g are summarized in column (1) of Table 16.1. The coefficients of this 
regression (only some of which are reported in the table) are estimates of the dynamic 
causal effect on orange juice price changes (in percent) for the first 18 months following 
a unit increase in the number of freezing degree days in a month. For example, a single 
freezing degree day is estimated to increase prices by 0.50% over the month in which the 
freezing degree day occurs. The subsequent effect on price in later months of a freezing 
degree day is less: After one month, the estimated effect is to increase the price by a further 
0.17%, and after two months, the estimated effect is to increase the price by an additional 
0.07%.The R? from this regression is 0.12, indicating that much of the monthly variation in 
orange juice prices is not explained by current and past values of FDD. 

Plots of dynamic multipliers can convey information more effectively than tables 
such as Table 16.1. The dynamic multipliers from column (1) of Table 16.1 are plotted in 
Figure 16.2a along with their 95% confidence intervals, computed as the estimated coef- 
ficient + 1.96 HAC standard errors. After the initial sharp price rise, subsequent price 
rises are less, although prices are estimated to rise slightly in each of the first six months 
after the freeze. As can be seen from Figure 16.2a, for months other than the first, the 
dynamic multipliers are not statistically significantly different from 0 at the 5% signifi- 
cance level, although they are estimated to be positive through the seventh month. 
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LEELA The Dynamic Effect of a Freezing Degree Day (FDD) on the Price of Orange Juice: 
Selected Estimated Dynamic Multipliers and Cumulative Dynamic Multipliers 
(1) (2) (3) (4) 

Lag Number Dynamic Multipliers Cumulative Multipliers Cumulative Multipliers Cumulative Multipliers 
0 0.50 0.50 0.50 0.51 

(0.14) (0.14) (0.14) (0.15) 
1 0.17 0.67 0.67 0.70 

(0.09) (0.14) (0.13) (0.15) 
2 0.07 0.74 0.74 0.76 

(0.06) (0.17) (0.16) (0.18) 
3 0.07 0.81 0.81 0.84 

(0.04) (0.18) (0.18) (0.19) 
4 0.02 0.84 0.84 0.87 

(0.03) (0.19) (0.19) (0.20) 
5 0.03 0.87 0.87 0.89 

(0.03) (0.19) (0.19) (0.20) 
6 0.03 0.90 0.90 0.91 

(0.05) (0.20) (0.21) (0.21) 
12 —0.14 0.54 0.54 0.54 

(0.08) (0.27) (0.28) (0.28) 
18 0.00 0.37 0.37 0.37 

(0.02) (0.30) (0.31) (0.30) 
Monthly No No No Yes 
indicators? F = 1.01 

(p = 0.43) 

HAC standard 7 7 14 7 
error truncation 
parameter (m) 
All regressions were estimated by OLS using monthly data (described in Appendix 16.1) from January 1950 to 
December 2000, for a total of T = 612 monthly observations. The dependent variable is the monthly percentage 
change in the price of orange juice (% ChgP,). Regression (1) is the distributed lag regression with the monthly 
number of freezing degree days and 18 of its lagged values—that is, FDD, FDD,-1, . . . , FD D,—~13,—and the reported 
coefficients are the OLS estimates of the dynamic multipliers. The cumulative multipliers are the cumulative sum of the 
estimated dynamic multipliers. All regressions include an intercept, which is not reported. Newey—West HAC standard 
errors, computed using the truncation number given in the final row, are reported in parentheses. 


he $$ 


Column (2) of Table 16.1 contains the cumulative dynamic multipliers for this 
specification—that is, the cumulative sum of the dynamic multipliers reported in 
column (1). These cumulative dynamic multipliers are plotted in Figure 16.2b along 
with their 95% confidence intervals. After 1 month, the cumulative effect of the 
freezing degree day is to increase prices by 0.67%; after 2 months, the price is esti- 
mated to have risen by 0.74%; and after 6 months, the price is estimated to have risen 
by 0.90%. As can be seen in Figure 16.2b, these cumulative multipliers increase 
through the seventh month because the individual dynamic multipliers are positive for 
the first 7 months. In the 8 month, the dynamic multiplier is negative, so the price of 
orange juice begins to fall slowly from its peak. After 18 months, the cumulative 
increase in prices is only 0.37%; that is, the long-run cumulative dynamic multiplier is 
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| FIGURE 16.2 | The Dynamic Effect of a Freezing Degree Day (FDD) on the Price of Orange Juice 
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(b) Estimated cumulative dynamic multipliers and 95% confidence interval 


only 0.37%. This long-run cumulative dynamic multiplier is not statistically signifi- 
cantly different from 0 at the 10% significance level (¢ = 0.37/0.30 = 1.23). 


Sensitivity analysis. As in any empirical analysis, it is important to check whether 
these results are sensitive to changes in the details of the empirical analysis. We 
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therefore examine three aspects of this analysis: sensitivity to the computation of the 
HAC standard errors, an alternative specification that investigates potential omitted 
variable bias, and an analysis of the stability over time of the estimated multipliers. 

First, we investigate whether the standard errors reported in the second column 
of Table 16.1 are sensitive to different choices of the HAC truncation parameter m. 
In column (3), results are reported for m = 14, twice the value used in column (2). 
The regression specification is the same as in column (2), so the estimated coeffi- 
cients and dynamic multipliers are identical; only the standard errors differ but, as it 
happens, not by much. We conclude that the results are insensitive to changes in the 
HAC truncation parameter. 

Second, we investigate a possible source of omitted variable bias. Freezes in 
Florida are not randomly assigned throughout the year but rather occur in the winter 
(of course). If demand for orange juice is seasonal (is demand for orange juice greater 
in the winter than in the summer?), then the seasonal patterns in orange juice 
demand could be correlated with FDD, resulting in omitted variable bias. The quan- 
tity of oranges sold for juice is endogenous: Prices and quantities are simultaneously 
determined by the forces of supply and demand. Thus, as discussed in Section 9.2, 
including quantity would lead to simultaneity bias. Nevertheless, the seasonal com- 
ponent of demand can be captured by including seasonal variables as regressors. The 
specification in column (4) of Table 16.1 therefore includes 11 monthly binary vari- 
ables, one indicating whether the month is January, one indicating whether the month 
is February, and so forth (as usual, one binary variable must be omitted to prevent 
perfect multicollinearity with the intercept). These monthly indicator variables are 
not jointly statistically significant at the 10% level (p = 0.43), and the estimated 
cumulative dynamic multipliers are essentially the same as for the specifications 
excluding the monthly indicators. In summary, seasonal fluctuations in demand are 
not an important source of omitted variable bias. 


Have the dynamic multipliers been stable over time? > To assess the stability of the 
dynamic multipliers, we need to check whether the distributed lag regression coef- 
ficients have been stable over time. Because we do not have a specific break date in 
mind, we test for instability in the regression coefficients using the Quandt likelihood 
ratio (QLR) statistic (Key Concept 15.9). The OLR statistic (with 15% trimming and 
HAC variance estimator) testing the stability of all the coefficients in the regression 
of column (1) has a value of 21.19, with g = 20 degrees of freedom (the coefficients 
on FDD,, its 18 lags, and the intercept). The 1% critical value in Table 15.5 is 2.43, so 
the QLR statistic rejects at the 1% significance level. These QLR regressions have 
40 regressors, a large number; recomputing them for 6 lags only (so that there are 16 
regressors and q = 8) also results in rejection at the 1% level. Thus the hypothesis 
that the dynamic multipliers are stable is rejected at the 1% significance level. 


`The discussion of stability in this subsection draws on material from Section 15.7 and can be skipped if 
that material has not been covered. 
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| FIGURE 16.3 | Estimated Cumulative Dynamic Multipliers from Different Sample Periods 
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One way to see how the dynamic multipliers have changed over time is to com- 
pute them for different parts of the sample. Figure 16.3 plots the estimated cumula- 
tive dynamic multipliers for the first third (1950-1966), middle third (1967-1983), and 
final third (1984-2000) of the sample, computed by running separate regressions on 
each subsample. These estimates show an interesting and noticeable pattern. In the 1950s 
and early 1960s, a freezing degree day had a large and persistent effect on the price. The 
magnitude of the effect on price of a freezing degree day diminished in the 1970s, 
although it remained highly persistent. In the late 1980s and 1990s, the short-run 
effect of a freezing degree day was the same as in the 1970s, but it became much less 
persistent and was essentially eliminated after a year. These estimates suggest that 
the dynamic causal effect on orange juice prices of a Florida freeze became smaller 
and less persistent over the second half of the 20" century. The box “Orange Trees 
on the March” discusses one possible explanation for the instability of the dynamic 
causal effects. 


ADL and GLS estimates. As discussed in Section 16.5, if the error term in the dis- 
tributed lag regression is serially correlated and FDD is strictly exogenous, it is 
possible to estimate the dynamic multipliers more efficiently than by OLS estima- 
tion of the distributed lag coefficients. Before using either the GLS estimator or 
the estimator based on the ADL model, however, we need to consider whether 
FDD is, in fact, strictly exogenous. True, humans cannot affect the daily weather, 
but does that mean that the weather is strictly exogenous? Does the error term u, 
in the distributed lag regression have conditional mean 0 given past, present, and 
future values of FDD? 
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Orange Trees on the March 


W hy do the dynamic multipliers in Figure 16.3 
vary over time? One possible explanation is 
changes in markets, but another is that the trees moved 
south. 

According to the Florida Department of Citrus, 
the severe freezes in the 1980s, which are visible in 
Figure 16.1c, spurred citrus growers to seek a warmer 
climate. As shown in Figure 16.4, the number of 
acres of orange trees in the more frost-prone north- 
ern and western counties fell from 232,000 acres 
in 1981 to 53,000 acres in 1985, and orange acre- 
age in southern and central counties subsequently 
increased from 413,000 in 1985 to 588,000 in 1993. 
With the groves farther south, northern frosts dam- 


age a smaller fraction of the crop, and—as indicated 


by the dynamic multipliers in Figure 16.3—price 
becomes less sensitive to temperatures in the more 
northern city of Orlando. 

OK, the orange trees themselves might not have 
been on the march—that can be left to Macbeth— 
but southern migration of the orange groves does 


give new meaning to the term nonstationarity.* 


“The Florida orange juice industry has experienced many 
other changes since the end of this data set in 2000. 
Demand for orange juice has declined, and imports from 
Brazil have increased. Perhaps most important has been 
the spread of a bacterial disease, citrus greening, that pre- 
vents oranges from maturing and kills citrus trees. Between 
2000 and 2015, total Florida orange production fell by 
approximately 60%. We are grateful to Professor James 
Cobbe of Florida State University for telling us about the 
southern movement of the orange groves. 
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| FIGURE 16.4 ] Orange Grove Acreage in Regions of Florida 
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The error term in the population counterpart of the distributed lag regression in 
column (1) of Table 16.1 is the discrepancy between the price and its population 
prediction based on the past 18 months of weather. This discrepancy might arise for 
many reasons, one of which is that traders use forecasts of the weather in Orlando. 
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NEWS FLASH: Commodity Traders Send Shivers Through Disney World 


A Ithough the weather at Disney World in 
Orlando, Florida, is usually pleasant, now 


and then a cold spell can settle in. If you are visiting 
Disney World on a winter evening, should you bring 
a warm coat? Some people might check the weather 
forecast on TV, but those in the know can do better: 
They can check that day’s closing price on the New 
York orange juice futures market! 

The financial economist Richard Roll (1984) 
undertook a detailed study of the relationship 
between orange juice prices and the weather. He 
examined the effect on prices of cold weather in 
Orlando, but he also studied the “effect” of changes 
in the price of an orange juice futures contract (a 
contract to buy frozen orange juice concentrate at 
a specified date in the future) on the weather. Roll 
used daily data from 1975 to 1981 on the prices of 
orange juice futures contracts traded at the New 
York Cotton Exchange and on daily and overnight 
temperatures in Orlando. He found that a rise in the 
price of the futures contract during the trading day 
in New York predicted cold weather—in particu- 
lar, a freezing spell—in Orlando over the following 
night. In fact, the market was so effective in predict- 


ing cold weather in Florida that a price rise during 


the trading day actually predicted forecast errors in 
the official U.S. government weather forecasts for 
that night. 

Roll’s study is also interesting for what he did not 
find: Although his detailed weather data explained 
some of the variation in daily orange juice futures 
prices, most of the daily movements in orange juice 
prices remained unexplained. He therefore sug- 
gested that the orange juice futures market exhib- 
its “excess volatility” —that is, more volatility than 
can be attributed to movements in fundamentals. 
Understanding why (and if) there is excess volatil- 
ity in financial markets is now an important area of 
research in financial economics. 

Roll’s finding also illustrates the difference 
between forecasting and estimating dynamic causal 
effects. Price changes on the orange juice futures 
market are a useful predictor of cold weather, but 
that does not mean that commodity traders are so 
powerful that they can cause the temperature to 
fall. Visitors to Disney World might shiver after an 
orange juice futures contract price rise, but they are 
not shivering because of the price rise—unless, of 
course, they went short in the orange juice futures 


market. 


For example, if an especially cold winter is forecasted, then traders would incorporate 
this into the price, so the price would be above its predicted value based on the popu- 
lation regression; that is, the error term would be positive. If this forecast is accurate, 
then, in fact, future weather would turn out to be cold. Thus future freezing degree 
days would be positive (X,+1 > 0) when the current price is unusually high (u, > 0), 
so corr(X,+1, ur) is positive. Stated more simply, although orange juice traders cannot 
influence the weather, they can—and do—predict it (see the box, “NEWS FLASH: 
Commodity Traders Send Shivers Through Disney World”). Consequently, the error 
term in the price/weather regression is correlated with future weather. In other 
words, FDD is exogenous, but if this reasoning is true, it is not strictly exogenous, and 
the GLS and ADL estimators will not be consistent estimators of the dynamic mul- 
tipliers. These estimators therefore are not used in this application. 
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Is Exogeneity Plausible? Some Examples 


As in regression with cross-sectional data, the interpretation of the coefficients in 
a distributed lag regression as causal dynamic effects hinges on the assumption that 
X is exogenous. If X, or its lagged values are correlated with u, then the conditional 
mean of u, will depend on X, or its lags, in which case X is not (past and present) 
exogenous. Regressors can be correlated with the error term for several reasons, 
but with economic time series data, a particularly important concern is that there 
could be simultaneous causality, which (as discussed in Sections 9.2 and 12.1) 
results in endogenous regressors. In Section 16.6, we discussed the assumptions 
of exogeneity and strict exogeneity of freezing degree days in detail. In this 
section, we examine the assumption of exogeneity in four other economic 
applications. 


U.S. Income and Australian Exports 


The United States is an important source of demand for Australian exports. Pre- 
cisely how sensitive Australian exports are to fluctuations in U.S. aggregate 
income could be investigated by regressing Australian exports to the United 
States against a measure of U.S. income. Strictly speaking, because the world 
economy is integrated, there is simultaneous causality in this relationship: A 
decline in Australian exports reduces Australian income, which reduces demand 
for imports from the United States, which reduces U.S. income. As a practical 
matter, however, this effect is very small because the Australian economy is much 
smaller than the U.S. economy. Thus U.S. income plausibly can be treated as exog- 
enous in this regression. 

In contrast, in a regression of European Union exports to the United States 
against U.S. income, the argument for treating U.S. income as exogenous is less 
convincing because demand by residents of the European Union for U.S. exports 
constitutes a substantial fraction of the total demand for U.S. exports. Thus a 
decline in U.S. demand for EU exports would decrease EU income, which in turn 
would decrease demand for U.S. exports and thus decrease U.S. income. Because 
of these linkages through international trade, EU exports to the United States and 
US. income are simultaneously determined, so in this regression U.S. income argu- 
ably is not exogenous. This example illustrates a more general point that whether 
a variable is exogenous depends on the context: U.S. income is plausibly exogenous 
in a regression explaining Australian exports but not in a regression explaining 
EU exports. 


Oil Prices and Inflation 


Ever since the oil price increases of the 1970s, macroeconomists have been interested 
in estimating the dynamic effect of an increase in the international price of crude oil 
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on the U.S. rate of inflation. Because oil prices are set in world markets in large part 
by foreign oil-producing countries, initially one might think that oil prices are exog- 
enous. But oil prices are not like the weather: Members of the Organization of 
Petroleum Exporting Countries set oil production levels strategically, taking many 
factors, including the state of the world economy, into account. To the extent that 
oil prices (or quantities) are set based on an assessment of current and future world 
economic conditions, including inflation in the United States, oil prices are 
endogenous. 


Monetary Policy and Inflation 


The central bankers in charge of monetary policy need to know the effect on infla- 
tion of monetary policy. Because an important tool of monetary policy is the short- 
term interest rate (the short rate), they need to know the dynamic causal effect on 
inflation of a change in the short rate. Although the short rate is determined by the 
central bank, it is not set by the central bankers at random (as it would be in an 
ideal randomized experiment); rather, it is set endogenously: The central bank 
determines the short rate based on an assessment of the current and future states 
of the economy, especially including the current and future rates of inflation. The 
rate of inflation in turn depends on the interest rate (higher interest rates reduce 
aggregate demand), but the interest rate depends on the rate of inflation, its past 
value, and its (expected) future value. Thus the short rate is endogenous, and the 
dynamic causal effect of a change in the short rate on future inflation cannot be 
consistently estimated by an OLS regression of the rate of inflation on current and 
past interest rates. 


The Growth Rate of GDP and the Term Spread 


In Chapter 15, lagged values of the term spread were used to forecast future values 
of the growth rate of GDP. Because lags of the term spread happened in the past, one 
might initially think that there cannot be feedback from current growth rates of 
GDP to past values of the term spread, so past values of the term spread can be 
treated as exogenous. But past values of the term spread were not randomly 
assigned in an experiment; instead, the past term spread was simultaneously deter- 
mined with past values of the growth rate of GDP. Because GDP and the interest 
rates making up the term spread are simultaneously determined, the other factors 
that determine the growth rate of GDP contained in u, are correlated with past 
values of the term spread; that is, the term spread is not exogenous. It follows that 
the term spread is not strictly exogenous, so the dynamic multipliers computed 
using an ADL model [for example, the ADL model in Equation (15.20)] are not 
consistent estimates of the dynamic causal effect on the growth rate of GDP of a 
change in the term spread. 
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Conclusion 


Time series data provide the opportunity to estimate the time path of the effect on 
Y of a change in X—that is, the dynamic causal effect on Y of a change in X. To esti- 
mate dynamic causal effects using a distributed lag regression, however, X must be 
exogenous, as it would be if it were set randomly in an ideal randomized experiment. 
If X is not just exogenous but is strictly exogenous, then the dynamic causal effects 
can be estimated using an autoregressive distributed lag model or by GLS. 

In some applications, such as estimating the dynamic causal effect on the price 
of orange juice of freezing weather in Florida, a convincing case can be made that the 
regressor (freezing degree days) is exogenous; thus the dynamic causal effect can be 
estimated by OLS estimation of the distributed lag coefficients. Even in this applica- 
tion, however, economic theory suggests that the weather is not strictly exogenous, 
so the ADL and GLS methods are inappropriate. Moreover, in many relations of 
interest to econometricians, there is simultaneous causality, so the regressor in these 
specifications is not exogenous, strictly or otherwise. Ascertaining whether the 
regressor is exogenous (or strictly exogenous) ultimately requires combining eco- 
nomic theory, institutional knowledge, and careful judgment. 


Summary 


1. Dynamic causal effects in time series are defined in the context of a random- 
ized experiment, where the same subject (entity) receives different randomly 
assigned treatments at different times. The coefficients in a distributed lag 
regression of Y on X and its lags can be interpreted as the dynamic causal 
effects when the time path of X is determined randomly and independently of 
other factors that influence Y. 

2. The variable X is (past and present) exogenous if the conditional mean of the 
error uin the distributed lag regression of Y on current and past values of X 
does not depend on current and past values of X. If, in addition, the conditional 
mean of u, does not depend on future values of X, then X is strictly exogenous. 

3. If X is exogenous, then the OLS estimators of the coefficients in a distributed 
lag regression of Y on current and past values of X are consistent estimators of 
the dynamic causal effects. In general, the error u, in this regression is serially 
correlated, so conventional standard errors are misleading and HAC standard 
errors must be used instead. 

4. If Xis strictly exogenous, then the dynamic multipliers can be estimated using 
either OLS estimation of an ADL model or GLS. 

5. Exogeneity is a strong assumption that often fails to hold in economic time 
series data because of simultaneous causality, and the assumption of strict exo- 
geneity is even stronger. 
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Review the Concepts 


16.1 In the 1970s, a common practice was to estimate a distributed lag model relat- 
ing changes in nominal GDP (Y) to current and past changes in the money 
supply (X). Under what assumptions will this regression estimate the causal 
effects of money on nominal GDP? Are these assumptions likely to be satis- 
fied in a modern economy like that of the United States? 


16.2 Suppose that X is strictly exogenous. A researcher estimates an ADL(1, 1) 
model, calculates the regression residual, and finds the residual to be highly 
serially correlated. Should the researcher estimate a new ADL model with 
additional lags or simply use HAC standard errors for the ADL(1, 1) esti- 
mated coefficients? 


16.3 Suppose that a distributed lag regression is estimated, where the dependent 
variable is AY, instead of Y, Explain how you would compute the dynamic 
multipliers of X, on Y, 


16.4 Suppose that you added FDD,..; as an additional regressor in Equation (16.2). 
If FDD is strictly exogenous, would you expect the coefficient on FDD,+1 to 
be 0 or nonzero? Would your answer change if FDD is exogenous but not 
strictly exogenous? 
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Exercises 


16.1 Increases in oil prices have been blamed for several recessions in developed coun- 


16.2 


tries. To quantify the effect of oil prices on real economic activity, researchers have 
run regressions like those discussed in this chapter. Let GDP, denote the value of 
quarterly real GDP in the United States, and let Y, = 100In(GDP,/GDP,_,) be 
the quarterly percentage change in GDP. James Hamilton, an econometrician and 
macroeconomist, has suggested that oil prices adversely affect that economy only 
when they jump above their values in the recent past. Specifically, let O, equal the 
greater of 0 or the percentage point difference between oil prices at date t and 
their maximum value during the past three years. A distributed lag regression 
relating Y, and O, estimated over 1960:Q1-2017:04, is 


A 


Ê = 1.0 — 0.0060, — 0.0140,_, — 0.0200,_» — 0.0240,_3 — 0.0360,_4 
(0.1) (0.013) (0.011) (0.010) (0.009) (0.012) 


— 0.0130,_; + 0.0050, — 0.0070,_7 + 0.0050,_s. 
(0.007) (0.010) (0.008) (0.008) 


a. Suppose that oil prices jump 25% above their previous peak value and stay at 


this new higher level (so that O, = 25 and O;41; = O2 = °°" 0). What 
is the predicted effect on output growth for each quarter over the next two 
years? 


b. Construct a 95% confidence interval for your answers to (a). 
ce. What is the predicted cumulative change in GDP growth over eight quarters? 
d. The HAC F-statistic testing whether the coefficients on O, and its lags 


are 0 is 5.45. Are the coefficients significantly different from 0? 


Macroeconomists have also noticed that interest rates change following oil 
price jumps. Let R, denote the interest rate on three-month Treasury bills (in 
percentage points at an annual rate). The distributed lag regression relating 
the change in R,(AR,) to O, estimated over 1960:Q1-2017:04 is 


AR, = 0.03 + 0.0130, + 0.0130,_, — 0.0040,_, — 0.0240,_3 — 0.0000,_4 


(0.05) (0.010) (0.010) (0.008) (0.015) (0.010) 
+ 0.0060,_5 — 0.0050,_ — 0.0180,_7 — 0.0040,_. 
(0.015) (0.015) (0.010) (0.006) 


a. Suppose that oil prices jump 25% above their previous peak value and stay 
at this new higher level (so that O, = 25 and O;,+1 = Oj12 = +++ = 0). 
What is the predicted change in interest rates for each quarter over the 
next two years? 


b. Construct 95% confidence intervals for your answers to (a). 
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c. What is the effect of this change in oil prices on the level of interest rates 
in period t + 8? How is your answer related to the cumulative multiplier? 


d. The HAC F-statistic testing whether the coefficients on O, and its lags 
are 0 is 1.92. Are the coefficients significantly different from 0? 


Consider two different randomized experiments. In experiment A, oil prices 
are set randomly, and the central bank reacts according to its usual policy 
rules in response to economic conditions, including changes in the oil price. In 
experiment B, oil prices are set randomly, and the central bank holds interest 
rates constant and in particular does not respond to the oil price changes. In 
both experiments, GDP growth is observed. Now suppose that oil prices are 
exogenous in the regression in Exercise 16.1. To which experiment, A or B, 
does the dynamic causal effect estimated in Exercise 16.1 correspond? 


Suppose that oil prices are strictly exogenous. Discuss how you could improve 
on the estimates of the dynamic multipliers in Exercise 16.1. 


Derive Equation (16.7) from Equation (16.4), and show that 69 = bo, 
ô1 = By, 55 = By + bz, 63 = By + B + Bs (etc.). (Hint: Note that X, = AX, + 
AX -1 te + AX ag t X-p) 


Consider the regression model Y, = By + BX; + u,, where u, follows the sta- 
tionary AR(1) model u, = @,u,_, + U, with %, i.i.d. with mean 0 and vari- 
ance o% and |¢,| < 1; the regressor X, follows the stationary AR(1) model 
X, = y,X,-1 + e, with e, iid. with mean 0 and variance o2 and | y| < 1;and 
e, is independent of u; for all t and i. 
2 
a. Show that var(u,) = se and var(X;) = —— ; 
"1 g ” 1- ¥ 
b. Show that cov(u,, u;—j) = d/var(u,) and cov( X, X, -;) = yivar(X)). 
c Show that corr(w,, u;;) = pj and corr(X, X,-) = yi. 
d. Consider the terms ø? and frin Equation (16.14). 
i. Show that o? = 0407, where o% is the variance of X and a? is the 
variance of u. 
ii. Derive an expression for f». 
Consider the regression model Y, = By + BX; + us where u, follows the sta- 
tionary AR(1) model u, = @,u,-, + u, with W,i.1.d. with mean 0 and variance 
ozand |¢,| < 1. 


a. Suppose that X, is independent of w; for all ¢ and j. Is X, exogenous (past 
and present)? Is X, strictly exogenous (past, present, and future)? 


b. Suppose that X, = u;+ 1. Is X, exogenous? Is X, strictly exogenous? 
Consider the model in Exercise 16.7 with X, = u,+}. 


a. Is the OLS estimator of 6, consistent? Explain. 


16.9 


16.10 


16.11 


16.12 
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b. Explain why the GLS estimator of £; is not consistent. 


c. Show that the infeasible GLS estimator pees P> Bi E 


1+ of 


[Hint: Apply the omitted variable formula in Equation (6.1) to the 
quasi-differenced regression in Equation (16.23).] 


Consider the constant-term-only regression model Y, = By + u, where u, fol- 
lows the stationary AR(1) model u, = ¢,u,_, + u, with u,1.i.d. with mean 0 
and variance a7 and |¢,| < 1. 

a. Show that the OLS estimator is Bo zT SGY. 


b. Show that the (infeasible) GLS estimator is BEES = (1 — by) UT — 1)! 
ELY, — p1Y,-1). [Hint: The GLS estimator of By is (1 — pı)! multiplied 
by the OLS estimator of ag in Equation (16.23). Why?] 

c. Show that BG45 can be written as 9+5 = (T — 1) DST Jy, + (1 - 6) 
(T — 1) (¥p — 6,¥;). [Hint: Rearrange the formula in (b).] 

d. Derive the difference Bo — BY", and discuss why it is likely to be small 


when T is large. 


Consider the ADL model Y, = 5.3 + 0.2Y,_; + 1.5X, — 0.1X,_, + u,, where 
X, is strictly exogenous. 

a. Derive the impact effect of X on Y. 

b. Derive the first five dynamic multipliers. 

c. Derive the first five cumulative multipliers. 

d. Derive the long-run cumulative dynamic multiplier. 

Suppose that a(L) = (1 — $L), with |¢,| < 1,andb(L) = 1 + @L +¢°L? + 
a Ceara 

a. Show that the product b(L)a(L) = 1, so that b(L) = a(L) 1. 

b. Why is the restriction |4| < 1 important? 

Suppose Y, = Bo + u,, where u, follows a stationary stationary AR(1) 
U, = u,-1 + Ñ, with % iid. with mean 0 and variance a2 and || < 1. 

a. Show that By = uy = E(Y,). 


b. Let Yr = 1S tY denote the sample mean of Y, using observations 
from ¢ = 1 through t = T. Show that the OLS estimator of Bo is Bo = Yik 

c. Show that var[ VT(¥,.7 — uy)] > o3/(1 — 61). 

d. Assume that Y;.7 is approximately normally distributed with mean py 
and variance o2/[ T(1 — $,)°]. Suppose T = 200, 0% = 7.9, , = 0.3, 
and the sample mean of Y, is Y;.7 = 2.8. Construct a 95% confidence 
interval for uy. 

e. Suppose you are interested in the average value of Y, fromt = T +1 
through T + h; that is, Yraq-rin = anf where / is a large number. 
Show that Yr+1:r+n has mean uy and variance o%/[h(1 — ¢)’]. 
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Assume that Yr+1:r+n is approximately normally distributed. Suppose 
h = 100, 02 = 79, ġı = 0.3, and py = 2.9. Construct a 95% forecast 
interval for Yr+j-7+)- 

Let r = h/T. Show that var[ VT(Yrour+n — Ye] a+r oe 
where r is held fixed as T— œ. 


- EP 


2 
Tù 


Show that Yr+1:-r+n — Yi:-r has mean 0 and variance (4 + ET F 
1 


Use the result in (i) to a that the forecast interval 


Yr + 1.96 V (4+ a 7 will contain the value of Y744-7+; 


with probability 95%, anpror mately: when T and A are large. (Assume 


that Yr+1:r+n — Y,-r is approximately normally distributed.) 
Suppose T = 200,4 = 100, o% = 79, ¢, = 0.3, and Yj.7 = 2.8. Construct 
a 95% forecast interval for Yp44.7+p- 


Empirical Exercises 


E16.1 In this exercise, you will estimate the effect of oil prices on macroeconomic activity 


E16.2 


using monthly data on the Index of Industrial Production (IP) and the monthly 


measure of O, described in Exercise 16.1.The data can be found on the text website, 
http://www.pearsonglobaleditions.com, in the file USMacro_Monthly. 


a. 


Compute the monthly growth rate in IP, expressed in percentage points, 
ip_growth, = 100 X In(/P,/IP,_,). What are the mean and standard 
deviation of ip_growth over the 1960:M1-2017:M12 sample period? 
What are the units for ip_growth (percent, percent per annum, percent 
per month, or something else)? 


Plot the value of O,. Why are so many values of O, equal to 0? Why 
aren't some values of O, negative? 


Estimate a distributed lag model by regressing ip_growth onto the 
current value and 18 lagged values of O,, including an intercept. What 
value of the HAC standard error truncation parameter m did you 
choose? Why? 


Taken as a group, are the coefficients on O, statistically significantly 
different from 0? 


Construct graphs like those in Figure 16.2, showing the estimated 
dynamic multipliers, cumulative multipliers, and 95% confidence 
intervals. Comment on the real-world size of the multipliers. 
Suppose that high demand in the United States (evidenced by large 


values of ip_growth) leads to increases in oil prices. Is O, exogenous? Are 
the estimated multipliers shown in the graphs in (e) reliable? Explain. 


In the data file USMacro_Quarterly, you will find data on two aggregate 


price series for the United States: the price index for personal consumption 


E16.3 
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expenditures (PCEP), which you used in Empirical Exercise 15.1, and the Con- 
sumer Price Index (CPI). These series are alternative measures of consumer 
prices in the United States. The CPI prices a basket of goods whose composi- 
tion is updated every 5-10 years. The PCEP uses chain weighting to price a bas- 
ket of goods whose composition changes from month to month. Economists 
have argued that the CPI will overstate inflation because it does not take into 
account the substitution that occurs when relative prices change. If this substi- 
tution bias is important, then average CPI inflation should be systematically 
higher than PCEP inflation. Let m”! = 400 x [In(CPI) — In(CPI,_;)], 
af CEP = 400 X [In(PCEP,) — In(PCEP,_,)], and Y, = n — wPCFP, so 


a is the quarterly rate of price inflation (measured in percentage points at 


an annual rate) based on the CPI, 7?” is the quarterly rate of price inflation 
from the PCEP, and Y, is their difference. Using data from 1963:Q1 through 
2017:Q4, carry out the following exercises. 


a. Compute the sample means of 7”! and 7?” Are these point estimates 


consistent with the presence of economically significant substitution bias in 
the CPI? 


b. Compute the sample mean of Y,. Explain why it is numerically equal to 
the difference in the means computed in (a). 

c. Show that the population mean of Y is equal to the difference of the 
population means of the two inflation rates. 

d. Consider the constant-term-only regression Y, = By + u, Show that 
Bo = E(Y). Do you think that u, is serially correlated? Explain. 

e. Construct a 95% confidence interval for By. What value of the HAC 
standard error truncation parameter m did you choose? Why? 

f. Is there statistically significant evidence that the mean inflation rate for 
the CPI is greater than the rate for the PCEP? 


g. Is there evidence of instability in By? Carry out a QLR test. (Hint: Make sure 
you use HAC standard errors for the regressions in the QLR procedure.) 


In the data file USMacro_Quarterly, you will find the data on U.S. real GDP 
(GDPC1) that was analyzed in Chapter 15. In this exercise, you will construct 
a 95% confidence interval for the mean growth rate of real GDP in the United 
States; in addition, you will construct a 95% forecast interval for the average 
growth rate of real GDP for 2018:Q1-2067:Q4. Before attempting this empiri- 
cal exercise, you should answer Exercise 16.12. 


a. Compute the growth rate of real GDP: Y, = 400 x [In(GDPC1,) — 
In(GDPC1,_,)]. Plot the series from 1960 through 2017, and verify that 
the data are the same as plotted in Figure 15.1b. 


b. Using the data from 1960:Q1 through 2017:04: 


646 


CHAPTER 16 Estimation of Dynamic Causal Effects 


i. Estimate an AR(1) model for Y, . In the notation of Exercise 16.12, 
denote the estimated AR(1) coefficient by by and the standard error 
of the regression as G7. 

ii. Compute the sample mean of Y, 

c. Assuming that Y, follows an AR(1), use the results you derived in Exer- 
cise 16.12, the estimated values of @, and o% from (b.i), and the sample 
mean from (b.ii) to 

i. Construct a 95% confidence interval for uy, the mean growth rate of 
real GDP. 

ii. Construct a 95% forecast interval for the average growth rate of real 
GDP over the period 2018:Q1—2067:Q4 — that is, for Y501801:206704- 

d. Using the data from 1960:Q1 through 2017:04: 

i. Regress Y,on a constant (with no other regressors). Construct the 
standard error for the estimated constant using the Newey—West 
HAC estimator with four lags. 

ii. Use the results from this regression to construct a 95% confidence 
interval for uy, the mean growth rate of real GDP. 

iii. Use the results from this regression to construct a 95% forecast 
interval for the average growth rate of real GDP over the period 
2018:01-2067:04—that is, for Y501801:206704- 


e. Are the intervals constructed in (d.ii) and (d.iii) similar to the intervals 
constructed in (c.i) and (c.ii)? Should they be? Explain. 


APPENDIX 


16.1 The Orange Juice Data Set 


The orange juice price data are the frozen orange juice component of the processed foods and feeds 
group of the Producer Price Index (PPI), collected by the U.S. Bureau of Labor Statistics (BLS Series 
wpu02420301). The orange juice price series was divided by the overall PPI for finished goods to 
adjust for general price inflation. The freezing degree days series was constructed from daily mini- 
mum temperatures recorded at Orlando-area airports, obtained from the National Oceanic and 
Atmospheric Administration (NOAA) of the U.S. Department of Commerce. The FDD series was 
constructed so that its timing and the timing of the orange juice price data were approximately 
aligned. Specifically, the frozen orange juice price data are collected by surveying a sample of produc- 
ers in the middle of every month, although the exact date varies from month to month. Accordingly, 
the FDD series was constructed to be the number of freezing degree days from the 11" of one month 
to the 10" of the next month; that is, FDD is the maximum of 0 and 32 minus the minimum daily 
temperature, summed over all days from the 11" to the 10". Thus %ChgP, for February is the per- 
centage change in real orange juice prices from mid-January to mid-February, and FDD, for February 


is the number of freezing degree days from January 11 to February 10. 
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The ADL Model and Generalized Least 
Squares in Lag Operator Notation 


Section 16.5 introduced the autoregressive distributed lag model for the case that the error 
term in the distributed lag model is AR(1). This appendix extends the ADL model to the case 
of AR(p) errors, using the lag operator notation introduced in Appendix 15.3. 


The Distributed Lag, ADL, and Quasi-Difference 
Models in Lag Operator Notation 


As defined in Appendix 15.3, the lag operator, L, has the property that LX, = X,_ j and the 
distributed lag B,X,+ BX; + +--+ B,+1X-, can be expressed as B(L)X,, where 
BL) = S1=06;+1L’,where L’ = 1.Thus the distributed lag model in Key Concept 16.1 [Equa- 


tion (16.4)] can be written in lag operator notation as 
Y, = Bo + B(L)X, + u, (16.30) 
In addition, if the error term u, follows an AR(p), then it can be written as 
b(L)u, = ù, (16.31) 


where $(L) = XP U, where dy = 1, and Ñ, is serially uncorrelated [note that, in the case 
p = 1,¢, as defined here is the negative of @, in the notation of Equation (16.19)]. 
To derive the ADL model, premultiply each side of Equation (16.30) by (L) so that 


P(L)Y¥, = H(L)[ + BIL)X, + u] = ay + H(L)X, + Ui, (16.32) 


where 
ay = A(1)Bp and 6(L) = (L)B(L), where (1) = 4, (16.33) 
= 


The model in Equation (16.32) is the ADL(p, q) model including the contemporaneous value 
of X, where p is the number of lags of Y and q is the number of lags of X. 

To derive the quasi-differenced model, note that 6(L)B(L)X, = B(L)¢(L)X, = B(L)X,, 
where X, = ¢(L)X, Thus rearranging Equation (16.32) yields 


Y, = ay + B(L)X, + i, (16.34) 


where Y, is the quasi-difference of Y; that is, Y, = $(L)Y,. 


The Inverse of a Lag Polynomial 


Let a(x) = Po ajx! denote a polynomial of order p. The inverse of a(x) —say, b(x)—is a func- 
tion that satisfies b(x)a(x) = 1. If the roots of the polynomial a(x) are greater than 1 in abso- 
lute value, then b(x) can be written as a polynomial in nonnegative powers of x:b(x) = Xj=o0b pe i 


Because b(x) is the inverse of a(x), it is denoted as a(x) ~! or as 1/a(x). 


648 


CHAPTER 16 Estimation of Dynamic Causal Effects 


The inverse of a lag polynomial a(L) is defined analogously: a(L)! = 1/a(L) = 
b(L) = D/£ob/L/, where b(L)a(L) = 1. For example, if a(L) = (1 — $L), with || < 1, you 
can verify that a(L)! = 1 + oL + PL? + pL’... = Do. (See Exercise 16.11.) 


The OLS and GLS Estimators 


The OLS estimator of the ADL coefficients is obtained by OLS estimation of Equation 
(16.32). The original distributed lag coefficients are B(L), which, in terms of the estimated 
coefficients, are B(L) = ¢(L) 18(L); that is, the coefficients in (L) satisfy the restrictions 
implied by #(L)B(L) = 6(L).Thus the estimator of the dynamic multipliers based on the OLS 
estimators of the coefficients of the ADL model, êL) and ĝ(L), is 


ÊP) = ASL). (16.35) 


The expressions for the coefficients in Equation (16.29) in the text are obtained as a special 
case of Equation (16.35) when p = 1 and q = 2. 

The feasible GLS estimator is computed by obtaining a preliminary estimator of (L), 
computing estimated quasi-differences, estimating B(L) in Equation (16.34) using these esti- 
mated quasi-differences, and (if desired) iterating until convergence. The iterated feasible GLS 
estimator is the nonlinear least squares estimator of the ADL model in Equation (16.32), 


subject to the nonlinear restrictions on the parameters contained in Equation (16.33). 


Conditions for estimation of the ADL coefficients. The discussion in Section 16.5 of the condi- 
tions for consistent estimation of the ADL coefficients in the AR(1) case extends to the general 


model with AR(p) errors. The conditional mean 0 assumption for Equation (16.34) is that 
OX Xa). (16.36) 
Because ui, = $(L)u, and X, = 6(L)X,, this condition is equivalent to 


E(u,| Xn Xi-1, ...) + bE (u-1|X, X-1, +) 
tots + Elu p| Xp X1,---) = 0. (16.37) 


For Equation (16.37) to hold for general values of ¢,, . . . , $p, it must be the case that each of 


the conditional expectations in Equation (16.37) is 0; equivalently, it must be the case that 
E(u,|X:+p> Aripi Xi+p-2 ERs ) = 0. (16.38) 


This condition is not implied by X, being (past and present) exogenous, but it is implied 
by X, being strictly exogenous. In fact, in the limit when p is infinite (so that the error term 
in the distributed lag model follows an infinite-order autoregression), the condition in 


Equation (16.38) becomes the condition in Key Concept 16.1 for strict exogeneity. 


Additional Topics in Time 
J Series Regression 


17.1 


his chapter takes up some further topics in time series regression, starting with 

forecasting. Chapter 15 considered forecasting a single variable. In practice, 
however, you might want to forecast two or more variables, such as the growth rate of 
gross domestic product (GDP) and the rate of inflation. Section 17.1 introduces a 
model for forecasting multiple variables, vector autoregressions (VARs), in which 
lagged values of two or more variables are used to forecast future values of those vari- 
ables. Chapter 15 focused on making forecasts one period (e.g., one quarter) into the 
future, but making forecasts two, three, or more periods into the future is important as 
well. Methods for making multi-period forecasts are discussed in Section 17.2. 

Sections 17.3 and 17.4 return to the topic of Section 15.6, stochastic trends. 
Section 17.3 introduces additional models of stochastic trends. Section 17.4 
introduces the concept of cointegration, which arises when two variables share a 
common stochastic trend—that is, when each variable contains a stochastic trend 
but a weighted difference of the two variables does not. 

In some time series data, especially financial data, the variance changes over time: 
Sometimes the series exhibits high volatility, while at other times the volatility is low, 
so the data exhibit clusters of volatility. Section 17.5 discusses volatility clustering and 
introduces models in which the variance of the forecast error changes over time—that 
is, models in which the forecast error is conditionally heteroskedastic. Models of 
conditional heteroskedasticity have several applications. One application is computing 
forecast intervals, where the width of the interval changes over time to reflect periods 
of high or low uncertainty. Another application is forecasting the uncertainty of returns 
on an asset, such as a stock, which in turn can be useful in assessing the risk of owning 
that asset or forecasting the price of derivative assets that depend on this risk. 

Section 17.6 takes up the challenge of forecasting when there are many predictors, as 
is the case for macroeconomic data in developed economies. This section draws on mate- 
rial introduced in Chapter 14 and focuses on one commonly used method for forecasting 
with large data sets, which uses principal components analysis to reduce the information in 
a large time series data set to a small number of time series. The framework for doing so is 
the dynamic factor model, which also can be used for purposes other than forecasting. 


Vector Autoregressions 
Chapter 15 focused on forecasting the growth rate of GDP, but in reality, economic 
forecasters are in the business of forecasting other key macroeconomic variables as 


well, such as the rate of inflation, the unemployment rate, and interest rates. One 
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Vector Autoregressions 


EA 


A vector autoregression (VAR) is a set of k time series regressions, in which the 
regressors are lagged values of all k series. A VAR extends the univariate autore- 
gression to a list, or “vector,” of time series variables. When the number of lags 
in each of the equations is the same and is equal to p, the system of equations is 
called a VAR(p). 

In the case of two time series variables, Y, and X, the VAR(p) consists of the 
two equations, 


W= io ee e ET fa T Vn D E V e e C 
Xp Ea Ea m E Gale a Yop a a, C) 


where the #’s and the y’s are unknown coefficients and u1; and uz, are error 
terms. 

The VAR assumptions are the time series regression assumptions of Key Con- 
cept 15.6 applied to each equation. The coefficients of a VAR are estimated by 
estimating each equation by OLS. 


approach is to develop a separate forecasting model for each variable, using the 
methods of Section 15.4. Another approach is to develop a single model that can 
forecast all the variables, which can help to make the forecasts mutually consistent. 
One way to forecast several variables with a single model is to use a vector autore- 
gression (VAR). A VAR extends the univariate autoregression to multiple time 
series variables; that is, it extends the univariate autoregression to a “vector” of time 
series variables. 


The VAR Model 


A vector autoregression (VAR) with two time series variables, Y, and X,, consists of 
two equations: In one, the dependent variable is Y;;in the other, the dependent vari- 
able is X,. The regressors in both equations are lagged values of both variables. More 
generally, a VAR with k time series variables consists of k equations, one for each of 
the variables, where the regressors in all equations are lagged values of all the vari- 
ables. The coefficients of the VAR are estimated by estimating each of the equations 
by ordinary least squares (OLS). 
VARs are summarized in Key Concept 171. 


Inference in VARs. Under the VAR assumptions, the OLS estimators are consistent 
and have a joint normal distribution in large samples. Accordingly, statistical 
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inference proceeds in the usual manner; for example, 95% confidence intervals on 
coefficients can be constructed as the estimated coefficient + 1.96 standard errors. 

One new aspect of hypothesis testing arises in VARs because a VAR with k 
variables is a collection, or system, of k equations. Thus it is possible to test joint 
hypotheses that involve restrictions across multiple equations. 

For example, in the two-variable VAR(p) in Equations (171) and (172), you 
could ask whether the correct lag length is p or p — 1; that is, you could ask whether 
the coefficients on Y,_, and X;_, are 0 in these two equations. The null hypothesis that 
these coefficients are 0 is 


Ho: Bip = 0, Pop = 0, Yip 7 0, and Y2p = 0. (173) 


The alternative hypothesis is that at least one of these four coefficients is nonzero. 
Thus the null hypothesis involves coefficients from both of the equations, two from 
each equation. 

Because the estimated coefficients have a jointly normal distribution in large 
samples, it is possible to test restrictions on these coefficients by computing an 
F-statistic. The precise formula for this statistic is complicated because the notation 
must handle multiple equations, so we omit it. In practice, most modern software 
packages have automated procedures for testing hypotheses on coefficients in sys- 
tems of multiple equations. 


How many variables should be included in a VAR? The number of coefficients in 
each equation of a VAR is proportional to the number of variables in the VAR. For 
example, a VAR with 5 variables and 4 lags will have 21 coefficients (4 lags each of 
5 variables, plus the intercept) in each of the 5 equations, for a total of 105 coefficients! 
As discussed in Section 14.2, estimating all these coefficients by OLS increases the 
amount of estimation error entering a forecast, which can result in deterioration of 
the accuracy of the forecast as measured by the mean squared forecast error 
(MSFE). If the VAR coefficients are estimated by OLS, the number of coefficients 
therefore should be small relative to the sample size, so the number of VAR vari- 
ables should be few. 

In this section, we consider small VARs with coefficients estimated by OLS. Because 
a small VAR has only a handful of variables, those variables should be chosen with care. 
One guideline is to make sure the variables are plausibly related to each other so that 
they will be useful for forecasting one another. For example, we know from a combina- 
tion of empirical evidence (such as that discussed in Chapter 15) and economic theory 
that the growth rate of GDP, the term spread, and the rate of inflation are related to one 
another, suggesting that these variables could help forecast one another in a VAR. 
Including an unrelated variable in a VAR, however, introduces estimation error without 
adding predictive content, thereby reducing forecast accuracy. 

An alternative approach is to use many variables but to use methods other than 
OLS. We take up forecasting with many predictors in Section 176. 
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Determining lag lengths in VARs. Lag lengths in a VAR can be determined using 
either F-tests or information criteria. 

The information criterion for a system of equations extends the single-equation 
information criterion in Section 15.5.To define this information criterion, we need to 
adopt matrix notation (reviewed in Appendix 19.1). Let £, be the k X k covariance 
matrix of the VAR errors, and let >, be the estimate of the covariance matrix, where 
the i,j element of $, is t57 digit jin Where û; is the OLS residual from the i™ equa- 
tion and û; is the OLS residual from the j™ equation. The Bayes information crite- 
rion (BIC) for the VAR is 


BIC(p) = S mato) 
(p) = Infdet(S,)] + k(kp + I), (174) 


where det(>,) is the determinant of the matrix >. The Akaike information crite- 
rion (AIC) is computed using Equation (174), modified by replacing the term 
ln(T) with 2. 

The expression for the BIC for the k equations in the VAR in Equation (174) 
extends the expression for a single equation given in Section 15.5. When there is a 
single equation, the first term simplifies to In[SSR(p)/7]. The second term in Equa- 
tion (174) is the penalty for adding additional regressors;k(kp + 1) is the total num- 
ber of regression coefficients in the VAR. (There are k equations, each of which has 
an intercept and p lags of each of the k time series variables.) 

Lag length estimation in a VAR using the BIC proceeds analogously to the single- 
equation case: Among a set of candidate values of p, the estimated lag length p is the 
value of p that minimizes BIC(p). 


Using VARs for causal analysis. The discussion so far has focused on using VARs for 
forecasting. Another use of VAR models is for analyzing causal relationships among 
economic time series variables; indeed, it was for this purpose that VARs were first 
introduced to economics by the econometrician and macroeconomist Christopher 
Sims (1980). (See the box “Nobel Laureates in Time Series Econometrics.”) The use 
of VARs for causal inference is known as structural VAR modeling — structural 
because in this application VARs are used to model the underlying structure of the 
economy. Structural VAR analysis uses the techniques introduced in this section in 
the context of forecasting plus some additional tools. The biggest conceptual differ- 
ence between using VARs for forecasting and using them for structural modeling, 
however, is that structural modeling requires very specific assumptions, derived from 
economic theory and institutional knowledge, of what is exogenous and what is not. 
The discussion of structural VARs is best undertaken in the context of estimation of 
systems of simultaneous equations, which goes beyond the scope of this book. For an 
introduction to using VARs for forecasting and policy analysis, see Stock and Watson 
(2001). For a graduate textbook treatment of structural VAR modeling, see Kilian 
and Liitkepohl (2017). 
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A VAR Model of the Growth Rate of GDP 
and the Term Spread 


As an illustration, consider a two-variable VAR for the growth rate of GDP, GDPGR,, 
and the term spread, TSpread,.The VAR for GDPGR, and TSpread, consists of two 
equations: one in which GDPGR, is the dependent variable and one in which 
TSpread, is the dependent variable. The regressors in both equations are lagged val- 
ues of GDPGR, and TSpread,. Because of the apparent break in the relation in the 
early 1980s found in Section 15.7 using the Quandt likelihood ratio (QLR) test, the 
VAR is estimated using data from 1981:Q1 to 2017:Q3. 
The first equation of the VAR is the GDP growth rate equation: 


ee 
GDPGR, = 0.54 + 0.29GDPGR,_, + 0.20GDPGR,_» 


(0.50) (0.11) (0.08) (175) 
—0.86 TSpread,_, + 1.18 TSpread,_». 
(0.35) (0.39) 


The adjusted R? is R? = 0.27. 

The second equation of the VAR is the term spread equation, in which the 
regressors are the same as in the GDPGR equation but the dependent variable is the 
term spread: 


ee 
TSpread, = 0.44 + 0.01GDPGR,_, — 0.05 GDPGR,_» 


(0.12) (0.02) (0.03) (176) 
+ 1.06 TSpread,—, — 0.22 TSpread,_>. 
(0.10) (0.11) 


The adjusted R? is R? = 0.82. 

Equations (175) and (176), taken together, are a VAR(2) model of the growth 
rate of GDP, GDPGR,, and the term spread, TSpread,. 

These VAR equations can be used to perform tests of predictability. The F-statistic 
testing the null hypothesis that the coefficients on TSpread,— and TSpread,-y are 0 in the 
GDP growth rate equation [Equation (175)] is 5.60, which has a p-value less than 0.001. 
Thus the null hypothesis is rejected, so we can conclude that the term spread is a useful 
predictor of the growth rate of GDP, given lags in the growth rate of GDP. The F-statistic 
testing the hypothesis that the coefficients on the two lags of GDPGR, are zero in the 
term spread equation [Equation (176)] is 3.22, which has a p-value of 0.04. Thus the 
growth rate of GDP helps predict the term spread at the 5% significance level. 

Forecasts of the growth rate of GDP and the term spread one period ahead are 
obtained exactly as discussed in Section 15.4. The forecast of the growth rate of GDP for 
2017:Q4, based on Equation (175), is CO ani = 2.8%. A similar calculation 
using Equation (176) gives a forecast of the term spread for 2017:Q4, based on data 
through 2017:Q3, of TSpread917:94\2017.93 = 1.3 percentage points. The actual values for 
2017:04 are GDPGR17.94 = 2.5% and TSpready 17.94 = 1.2 percentage points. 
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17.2 


Multi-period Forecasts 


The discussion of forecasting so far has focused on making forecasts one period in 
advance. Often, however, forecasters are called upon to make forecasts further into 
the future. This section describes two methods for making multi-period forecasts, 
which are also called multi-step forecasts. The first method is to construct iterated 
forecasts, in which a one-period ahead model is iterated forward one period at a time 
in a way that is made precise in this section. The second method is to make direct 
forecasts by using a regression in which the dependent variable is the multi-period 
variable that one wants to forecast. For reasons discussed at the end of this section, 
in most applications the iterated method is recommended over the direct method. 


Iterated Multi-period Forecasts 


The essential idea of an iterated forecast is that a forecasting model is used to make 
a forecast one period ahead, for period T + 1, using data through period T. The 
model then is used to make a forecast for date T + 2, given the data through date T, 
where the forecasted value for date T + 1 is treated as data for the purpose of mak- 
ing the forecast for period T + 2.Thus the one-period ahead forecast (which is also 
referred to as a one-step ahead forecast) is used as an intermediate step to make the 
two-period ahead forecast. This process repeats, or iterates, until the forecast is made 
for the desired forecast horizon h. 


The iterated AR forecast method: AR(1). An iterated AR(1) forecast uses an AR(1) 
for the one-period ahead model. For example, consider the first-order autoregression 
for GDPGR [Equation (15.9)]: 


SS 
GDPGR, = 1.95 + 0.34 GDPGR,-4. 


(0.32) (0.07) (177) 


The first step in computing the two-quarter ahead forecast of GDPGR4918.9; based 
on Equation (177) and using data through 2017:Q3 is to compute the 
one-quarter ahead forecast of GDPGR 17.94 based on data through 2017:Q3: 
GDPGRo17-04)2017-03 = 1.95 + 0.34GDPGRy7.03 = 1.95 + 0.34 X 3.11 = 3.0. 
The second step is to substitute this forecast into Equation (177), so that 
GDPGRo18:01|2017:03 = 1.95 + 0.34 GDPGRo17:94)2017:03 = 1.95 + 0.34 X 3.0 = 
3.0. Thus, based on information through the third quarter of 2017, this forecast states 
that the growth rate of GDP will be 3.0% in the first quarter of 2018. 


The iterated AR forecast method: AR(p). The iterated AR(1) strategy is extended to 
an AR(p) by replacing Y7, with its forecast, Vix 17, and then treating that forecast 
as data for the AR(p) forecast of Y;5. For example, consider the iterated two-period 


17.2 Multi-period Forecasts 655 


ahead forecast of the growth rate of GDP based on the AR(2) model from Section 15.3 
[Equation (15.11)]: 


ee 
GDPGR, = 1.60 + 0.28GDPGR,_; + 0.18GDPGR,_». 


(0.37) (0.08) (0.08) me) 


The forecast of GDPGRoo17.94 based on data through 2017:Q3 using this AR(2), 
computed in Section 15.3, is GDPGRyo17.04)2017.03 = 3.0. Thus the two-quarter 
ahead iterated forecast based on the AR(2) is GDPGR591-01|2017:03 = 1-60 + 0.28 
GDPGR 9017:94|2017:03 + 0.18 GDPGR 2917:03 = 1.60 + 0.28 X 3.0 + 0.18 X 3.1 = 
3.0. According to this iterated AR(2) forecast, based on data through the third 
quarter of 2017, the growth rate of GDP is predicted to be 3.0% in the first quarter 
of 2018. 


Iterated multivariate forecasts using an iterated VAR. Iterated multivariate fore- 
casts can be computed using a VAR in much the same way as iterated univariate 
forecasts are computed using an autoregression. The main new feature of an iterated 
multivariate forecast is that the two-step ahead (period T + 2) forecast of one vari- 
able depends on the forecasts of all variables in the VAR in period T + 1. For exam- 
ple, to compute the forecast of the growth rate of GDP in period T + 2 using a VAR 
with the variables GDPGR, and TSpread,, one must forecast both GDPGR r41 
and TSpready,,, using data through period T as an intermediate step in forecasting 
GDPGR 74. More generally, to compute multi-period iterated VAR forecasts h peri- 
ods ahead, it is necessary to compute forecasts of all variables for all intervening 
periods between Tand T + A. 

As an example, we will compute the iterated VAR forecast of GDPGR 918.01 
based on data through 2017:Q3, using the VAR(2) for GDPGR, and TSpread, in 
Section 171 [Equations (175) and (17.6)]. The first step is to compute the one- 
quarter ahead forecasts GDPGRyrcionises and TS preada: from that 
VAR. These one-period ahead forecasts were computed in Section 17.1 based on 
Equations (17.5) and (176). The forecasts were GDPGR)917.94\2017:03 = 2.8 and 
TSpready17:94|2017:03 = 1.3. In the second step, these forecasts are substituted 
into Equations (175) and (176) to produce the two-quarter ahead forecast: 


ee ee a 
GDPGRy18.01\2017:03 = 0.54 + 0.29 GDPGRy917:04)2017:03 + 0.20GDPG Ro917:03 
—0.86 TSpready917.94\2017.93 + 1.28TSpready17.93 
= 0.54 + 0.29 X 2.8 + 0.20 x 3.1 
—0.86 X 1.3 + 1.28 x 1.2 = 2.4. 


(179) 


Thus the iterated VAR(2) forecast, based on data through the third quarter of 2017, 
is that the growth rate of GDP will be 2.4% in the first quarter of 2018. 
Iterated multi-period forecasts are summarized in Key Concept 172. 
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The iterated multi-period AR forecast is computed in steps: First compute the 
one-period ahead forecast, and then use that to compute the two-period ahead 
forecast, and so forth. The two- and three-period ahead iterated forecasts based 
on an AR(p) are 


Yrialr = ĝ + BYrsar + ÊYr + BsYru +--+ ByYr-p+2 (1710) 


Yrsair = By + Bi Yrsair + Year + BY¥p t+ °°: +BY r+ (1711) 


where the ĝ’s are the OLS estimates of the AR(p) coefficients. Continuing this 
process (iterating) produces forecasts further into the future. 

The iterated multi-period VAR forecast is also computed in steps: First com- 
pute the one-period ahead forecast of all the variables in the VAR, then use those 
forecasts to compute the two-period ahead forecasts, and continue this process 
iteratively to the desired forecast horizon. The two-period ahead iterated forecast 
of Yr,2, based on the two-variable VAR(p) in Key Concept 171, is 


Yriair = Bio + BuYrsijr + Bi2Yr + Bis¥ra + +++ + Bip¥r-p+2 (1712) 

+ YX + YX + WX + +++ + fip Xpt» 
where the coefficients in Equation (1712) are the OLS estimates of the VAR coef- 
ficients. Iterating produces forecasts further into the future. 


Direct Multi-period Forecasts 


Direct multi-period forecasts are computed without iterating by using a single regres- 
sion, in which the dependent variable is the multi-period ahead variable to be fore- 
casted and the regressors are the predictor variables. Forecasts computed this way 
are called direct forecasts because the regression coefficients can be used directly to 
make the multi-period forecast. 


The direct multi-period forecasting method. Suppose that you want to make a fore- 
cast of Yr, using data through time T. The direct multivariate method takes the 
ADL model as its starting point but lags the predictor variables by an additional time 
period. For example, if two lags of the predictors are used, then the dependent vari- 
able is Y, and the regressors are Y,5, Y-3, X;-2, and X,_3. The coefficients from this 
regression can be used directly to compute the forecast of Y7,2 using data on Yr, Yr, 
Xr, and X7_;, without the need for any iteration. More generally, in a direct h-period 
ahead forecasting regression, all predictors are lagged h periods to produce the 
h-period ahead forecast. 
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For example, the forecast of GDPGR, two quarters ahead using two lags each of 
GDPGR,-_ and TSpread,_, is computed by first estimating the regression: 


— 
GDPGR, = 0.56 + 0.31GDPGR, + 0.04GDPGR,- 


(0.63) (0.07) (0.09) (1713) 
+ 0.56TSpread,» + 0.04TSpread,_3. 
(0.46) (0.45) 


The two-quarter ahead forecast of the growth rate of GDP in 2018:Q1 based on data 
through 2017:Q3 is computed by substituting the values of GDPGR 947.93, 
GDPGR317:02; TSpready 17.93; and TSpready 17-92 into Equation (1713); this yields 


oe 
GDPGR%518.01\2017.93 = 0.56 + 0.31GDPGRy17.93 + 0.04GDPGRo917.92 


1714 
+ 0.56T Spread 7.03 + 0.047 Spready 17.02 = 2.4. ( ) 


The three-quarter ahead direct forecast of GDPGR r+; is computed by lagging all the 
regressors in Equation (1713) by one additional quarter, estimating that regression, and 
then computing the forecast. The h-quarter ahead direct forecast of GDPGR7,,, is 
computed by using GPDGR, as the dependent variable and the regressors GPDGR,_,;, 
and TSpread,_;, plus additional lags of GPDGR,_;, and TSpread,_y, as desired. 


Standard errors in direct multi-period regressions. Because the dependent variable 
in a multi-period regression occurs two or more periods into the future, the error 
term in a multi-period regression is serially correlated. To see this, consider the two- 
period ahead forecast of the GDP growth rate, and suppose that a surprise jump in 
oil prices occurs in the next quarter. Today’s two-period ahead forecast of the growth 
rate of GDP will be too high because it does not incorporate this unexpected nega- 
tive event. Because the oil price rise was also unknown in the previous quarter, the 
two-period ahead forecast made last quarter will also be too high. Thus the surprise 
oil price jump next quarter means that both last quarter’s and this quarter’s two- 
period ahead forecasts are too high. Because of such intervening events, the error 
term in a multi-period regression is serially correlated. 

As discussed in Section 16.4, if the error term is serially correlated, the usual OLS 
standard errors are incorrect, or, more precisely, they are not a reliable basis for infer- 
ence. Therefore, heteroskedasticity- and autocorrelation-consistent (HAC) standard 
errors must be used with direct multi-period regressions. The standard errors reported in 
Equation (1713) for direct multi-period regressions therefore are Newey-West HAC 
standard errors, where the truncation parameter m is set according to Equation (16.17); 
for these data (for which T = 147), Equation (16.17) yields m = 4. For longer forecast 
horizons, the amount of overlap—and thus the degree of serial correlation in the 
error—increases: In general, the first h — 1 autocorrelation coefficients of the errors 
in an h-period ahead regression are nonzero. Thus larger values of m than indicated by 
Equation (16.17) are appropriate for multi-period regressions with long forecast 
horizons. 

Direct multi-period forecasts are summarized in Key Concept 173. 
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17.3 


17.3 


The direct multi-period forecast 4 periods into the future based on p lags each of 
Y, and an additional predictor X, is computed by first estimating the regression 


6 = A ar D P E e ar Oal or Oe 


NES 
eg oe ÔA p hii túr ( ) 


and then using the estimated coefficients directly to make the forecast of Y;+, 
using data through period T. 


Which Method Should You Use? 


In most applications, the iterated method is the recommended procedure for multi- 
period forecasting for two reasons. First, from a theoretical perspective, if the under- 
lying one-period ahead model (the AR or VAR that is used to compute the iterated 
forecast) is specified correctly, then the coefficients are estimated more efficiently if 
they are estimated by a one-period ahead regression (and then iterated) than by a 
multi-period ahead regression. Second, from a practical perspective, forecasters are 
usually interested in forecasts not just at a single horizon but at multiple horizons. 
Because they are produced using the same model, iterated forecasts tend to have 
time paths that are less erratic across horizons than do direct forecasts. Because a 
different model is used at every horizon for direct forecasts, sampling error in the 
estimated coefficients can add random fluctuations to the time paths of a sequence 
of direct multi-period forecasts. 

Under some circumstances, however, direct forecasts are preferable to iterated 
forecasts. One such circumstance is when you have reason to believe that the one- 
period ahead model (the AR or VAR) is not specified correctly. For example, you 
might believe that the equation for the variable you are trying to forecast ina VAR 
is specified correctly but that one or more of the other equations in the VAR are 
specified incorrectly, perhaps because of neglected nonlinear terms. If the one-step 
ahead model is specified incorrectly, then, in general, the iterated multi-period fore- 
cast will be biased, and the MSFE of the iterated forecast can exceed the MSFE of 
the direct forecast, even though the direct forecast has a larger variance. 


Orders of Integration and the Nonnormality 
of Unit Root Test Statistics 


This section extends the treatment of stochastic trends in Section 15.6 by addressing 
two further topics. First, the trends of some time series are not well described by the 
random walk model, so we introduce an extension of that model and discuss its 
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implications for regression modeling of such series. Next we discuss the reason for 
the nonnormal distribution of the ADF test for a unit root. 


Other Models of Trends and Orders of Integration 


Recall that the random walk model for a trend, introduced in Section 15.6, specifies 
that the trend at date ¢ equals the trend at date t — 1 plus a random error term. If Y, 
follows a random walk with drift Bp, then 


Y, = Bo + Y + uy (1716) 


where u, is serially uncorrelated. Also recall from Section 15.6 that, if a series has a 
random walk trend, then it has an autoregressive root that equals 1. 

Although the random walk model of a trend describes the long-run move- 
ments of many economic time series, some economic time series have trends that 
are smoother — that is, that vary less from one period to the next —than is implied 
by Equation (17.16). A different model is needed to describe the trends of such 
series. 

One model of a smooth trend makes the first difference of the trend follow a 
random walk; that is, 


AY, = Bo + AY + Un (17.17) 


where u; is serially uncorrelated. Thus, if Y, follows Equation (17.17), A Y, follows a ran- 
dom walk, so AY, — AY, is stationary. The difference of the first differences, 
AY, — AY,4, is called the second difference of Y, and is denoted A’Y, = AY, — AY. 
In this terminology, if Y, follows Equation (1717), then its second difference is stationary. 
If a series has a trend of the form in Equation (1717), then the first difference of the 
series has an autoregressive root that equals 1. 


Orders of integration terminology. Some additional terminology is useful for dis- 
tinguishing between these two models of trends. A series that has a random walk 
trend is said to be integrated of order one, or J(1). A series that has a trend of the 
form in Equation (1717) is said to be integrated of order two, or I(2). A series that 
does not have a stochastic trend and is stationary is said to be integrated of order 
zero, or I(0). 

The order of integration in the /(1) and /(2) terminology is the number of times 
that the series needs to be differenced for it to be stationary: If Y, is (1), then the first 
difference of Y, AY, is stationary, and if Y, is (2), then the second difference of Y, 
A’Y,, is stationary. If Y, is 1(0), then Y, is stationary. 

Orders of integration are summarized in Key Concept 174. 
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e If Y, is integrated of order one—that is, if Y, is /(1)—then Y, has a unit 
autoregressive root, and its first difference, A Y, is stationary. 


e If Y, is integrated of order two—that is, if Y, is (2)—then AY, has a unit 
autoregressive root, and its second difference, A~Y,, is stationary. 


e If Y, is integrated of order d—that is, if Y, is 1(d)—then Y, must be dif- 
ferenced d times to eliminate its stochastic trend; that is, A“Y, is stationary. 


How to test whether a series is I(2) or I(1). If Y, is 1(2), then AY, is [(1),so AY, has an 
autoregressive root that equals 1. If, however, Y, is /(1), then AY, is stationary. Thus 
the null hypothesis that Y, is /(2) can be tested against the alternative hypothesis that 
Y, is I(1) by testing whether AY, has a unit autoregressive root. If the hypothesis that 
AY, has a unit autoregressive root is rejected, then the hypothesis that Y, is Z(2) is 
rejected in favor of the alternative that Y, is /(1). 


Examples of I(2) and I(1) series: The price level and the rate of inflation. The rate 
of inflation is the growth rate of the price level. Recall from Section 15.2 that the 
growth rate of a time series X, can be computed as the first difference of the loga- 
rithm of X; that is, AlnCX,) is the growth rate of X, (expressed as fraction). If P, is a 
time series for the price level measured quarterly, then Aln(P,) is its growth rate, and 
Infl, = 400 Xx Aln(P,) is the quarterly rate of inflation, measured in percentage 
points at an annual rate. As in the expression for the growth of GDP, GDPGR in 
Equation (15.1), the factor 400 arises from converting fractional changes to percent- 
age changes (multiplying by 100) and converting quarterly percentages to an annual 
rate (multiplying by 4). 

In Empirical Exercise 15.1, you analyzed the inflation rate, /nfl,, computed 
using the price index for personal consumption expenditures in the United States 
as P, In that exercise, you concluded that the rate of inflation in the United States 
plausibly has a random walk stochastic trend—that is, that the rate of inflation is 
I(1). If inflation is /(1), then its stochastic trend is removed by first differencing, so 
Alnfl, is stationary. But treating inflation as /(1) is equivalent to treating Aln(P,;) as 
J(1), and this in turn is equivalent to treating the logarithm of the price level, In(P,), 
as [(2). 

The logarithm of the price level and the rate of inflation are plotted in Figure 171. 
The long-run trend of the logarithm of the price level (Figure 171a) varies more 
smoothly than the long-run trend in the rate of inflation (Figure 171b). The smooth 
trend in the logarithm of the price level is typical of (2) series. 
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The trend in the logarithm of prices (Figure 17.1a) is much smoother than the trend in inflation (Figure 17.1b). 
Xe A 


Why Do Unit Root Tests Have Nonnormal Distributions? 


In Section 15.7, it was stressed that the large-sample normal distribution on which 
regression analysis relies so heavily does not apply if the regressors are nonstationary. 
Under the null hypothesis that the regression contains a unit root, the regressor Y,—4 
in the Dickey—Fuller regression is nonstationary. The nonnormal distribution of the 
unit root test statistics is a consequence of this nonstationarity. 
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To gain some mathematical insight into this nonnormality, consider the simplest 
possible Dickey—Fuller regression, in which AY, is regressed against the single regres- 
sor Y,_, and the intercept is excluded. In the notation of Equation (15.32), the OLS 
estimator in this regression is § = Sey GAY Diy 4,50 


1 T 
TS Y,-1AY, 


ml 
II 


laa (1719) 
T2 Ara 
Consider the numerator in Equation (17.19). Under the additional assumption that 
Yo = 0,a bit of algebra (Exercise 17.5) shows that 


T 


7 YAY = (<4) Sayy]. (1720) 


Under the null hypothesis, AY, = u,, which is serially uncorrelated and has a 
finite variance, so the second term in Equation (1720) has the probability limit 
aE (AY —> ø}. Under the assumption that Y = 0, the first term in 
Equation (1720) can be written Yr/ VT = VSL AY, = VED2,u,,which in turn 
obeys the central limit theorem; that is, Yr/ VT N(O, o2). Thus 
(Yr / VTE - 45L (AY? -L o+(Z* — 1),where Zis a standard normal random 
variable. Recall, however, that the square of a standard normal distribution has a 
chi-squared distribution with 1 degree of freedom. It therefore follows from Equa- 
tion (17.20) that, under the null hypothesis, the numerator in Equation (17.19) has the 
limiting distribution 


1 L d ői 2 
T ann —> rae = 1). (1721) 
t= 


The large-sample distribution in Equation (1721) is different than the usual large- 
sample normal distribution when the regressor is stationary. Instead, the numerator 
of the OLS estimator of the coefficient on Y, in this Dickey—Fuller regression has a 
distribution that is proportional to a chi-squared distribution with 1 degree of free- 
dom minus 1. 

This discussion has considered only the numerator of T. The denominator also 
behaves unusually under the null hypothesis: Because Y, follows a random walk 
under the null hypothesis, +$ 2, Y?_; does not converge in probability to a constant. 
Instead, the denominator in Equation (1719) is a random variable, even in large 
samples: Under the null hypothesis, aan 7, Y?_, converges in distribution jointly 
with the numerator. The unusual joint distribution of the numerator and denomina- 
tor in Equation (1719) are the source of the nonstandard distribution of the Dickey- 
Fuller test statistic and the reason that the ADF statistic has its own special table of 
critical values. 
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Cointegration 


Sometimes two or more series share the same stochastic trend. In this special case, 
referred to as cointegration, regression analysis can reveal long-run relationships 
among time series variables, but some new methods are needed. 


Cointegration and Error Correction 


Two or more time series with stochastic trends can move together so closely over the 
long run that they appear to have the same trend component; that is, they appear to 
have a common trend. For example, the 90-day and 10-year U.S. Treasury interest 
rates in Figure 15.3 exhibit the same long-run tendencies or trends: Both were low in 
the 1960s, both rose through the 1970s to peaks in the early 1980s, and then both fell 
through the 1990s. However, the difference between the long-term and short-term 
interest rates, the term spread shown in Figure 15.3b, does not appear to have a trend. 
That is, subtracting the short-term rate from the long-term rate appears to elimi- 
nate the trends in both of the individual rates. Said differently, although the two 
interest rates differ, they appear to share a common stochastic trend: Because the 
trend in each individual series is eliminated by subtracting one series from the 
other, the two series must have the same trend; that is, they must have a common 
stochastic trend. 

Two or more series that have a common stochastic trend are said to be cointe- 
grated. The formal definition of cointegration (due to the econometrician Clive 
Granger; see the box “Nobel Laureates in Time Series Econometrics”) is given in 
Key Concept 175. In this section, we introduce a test for whether cointegration is 
present, discuss estimation of the coefficients of regressions relating cointegrated 
variables, and illustrate the use of the cointegrating relationship for forecasting. The 
discussion initially focuses on the case that there are only two variables, X, and Y,. 


Vector error correction model. If X, and Y, are cointegrated, the first differences of 
X,and Y, can be modeled using a VAR, augmented by including Y,_; — 0X,-1 as an 
additional regressor: 


AY, = Bio + By AY,-1 + +++ + BipAY -p + yu AX-1 


1722 
tees + yypAX,, + ay(Y-1 — 0X1) + uy ( ) 


AX, = Boo + BuAY,-1 + +++ + BypAY-p + yn AX-1 (1723) 
Hitt + Yop AX,» + ao(Y-1 — OX;-1) + uz 
The term Y, — 0X, is called the error correction term: if the two variables are far 
apart, by virtue of their sharing a trend, one would expect the variables to get closer 
together over time, so that the “error” Y, — 0X, will be “corrected.” 
The combined model in Equations (1722) and (1723) is called a vector error 
correction model (VECM). In a VECM, past values of Y, — 0X, help to predict future 
values of AY, and/or AX. 
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Suppose that X, and Y, are integrated of order one. If, for some coefficient 
0, Y, — 0X, is integrated of order zero, then X, and Y, are said to be cointegrated. 
The coefficient 0 is called the cointegrating coefficient. 

If X, and Y, are cointegrated, then they have the same, or a common, stochastic 
trend. Computing the difference Y, — 0X, eliminates this common stochastic trend. 


How Can You Tell Whether Two Variables 
Are Cointegrated? 


There are three ways to determine whether two variables can plausibly be mod- 
eled as cointegrated: You can use expert knowledge and economic theory, graph 
the series and see whether they appear to have a common stochastic trend, and 
perform statistical tests for cointegration. In practice, you should use all three 
methods. 

For example, the two interest rates in Figure 15.3 are linked together by the so- 
called expectations theory of the term structure of interest rates, which holds that the 
10-year Treasury bond rate is the average of the sequence of expected interest rates 
on 3-month Treasury bills over the 10-year life of the bond. Thus, if the 3-month inter- 
est rate has a random walk stochastic trend, this theory implies that this stochastic 
trend is inherited by the 10-year interest rate (Exercise 172). Moreover, the plot of 
the two interest rates in Figure 15.3 shows that each of the series appears to be (1) 
but that the term spread appears to be J(0), so it is plausible that the two series are 
cointegrated. 

The unit root testing procedures introduced so far can be extended to tests 
for cointegration. The insight on which these tests are based is that if Y, and X, 
are cointegrated with cointegrating coefficient 0, then Y, — 0X, is stationary; 
otherwise, Y, — 0X, is nonstationary —that is, /(1). The hypothesis that Y, and X, 
are not cointegrated— that is, that Y, — 0X, is /(1)—therefore can be tested by testing 
the null hypothesis that Y, — 0X, has a unit root; if this hypothesis is rejected, then Y, 
and X, can be modeled as cointegrated. The details of this test depend on whether 
the cointegrating coefficient 6 is known. 


Testing for cointegration when 0 is known. In many cases, expert knowledge or 
economic theory suggests a value for 6. When 6 is known, the ADF unit root tests can 
be used to test for cointegration by first constructing the series z; = Y, — 0X, and 
then testing the null hypothesis that z, has a unit autoregressive root. 

As an illustration, applying the ADF test to the term spread (the difference 
between the 10-year and 90-day Treasury rates) from 1962 to 2017, with an intercept 
and (AIC-determined) six lags, yields an ADF statistic of —4.13. This value is less 
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Critical Values for the Engle-Granger ADF Statistic 

Number of X's in Equation (17.24) 10% 5% 1% 

1 —3.12 —3.41 —3.96 

2 —3.52 —3.80 —4.36 

3 —3.84 —4.16 —4.73 
(4 —4.20 —4.49 —5.07 


than —3.43 from Table 15.4, so the null hypothesis of no cointegration (a unit root in 
the term spread) is rejected at the 1% significance level. 


Testing for cointegration when 6 is unknown. If the cointegrating coefficient 6 is 
unknown, then it must be estimated prior to testing for a unit root in the error cor- 
rection term. This preliminary step makes it necessary to use different critical values 
for the subsequent unit root test. 

Specifically, in the first step the cointegrating coefficient 0 is estimated by OLS 
estimation of the regression 


Y, =a + 0X, + z. (1724) 


In the second step, a Dickey—Fuller t-test (with an intercept but no time trend) is used 
to test for a unit root in the residual from this regression, Z,. This two-step procedure 
is called the Engle-Granger Augmented Dickey—Fuller test for cointegration, or 
EG-ADF test (Engle and Granger 1987). 

Critical values of the EG-ADF statistic are given in Table 171.' The critical values 
in the first row apply when there is a single regressor in Equation (17.26), so there are 
two cointegrated variables (X, and Y,). The subsequent rows apply to the case of 
multiple cointegrated variables, which is discussed at the end of this section. 


Estimation of Cointegrating Coefficients 


If X,and Y, are cointegrated, then the OLS estimator of the coefficient in the cointe- 
grating regression in Equation (1724) is consistent. However, in general, the OLS 
estimator (like the ADF test statistic, for similar reasons) has a nonnormal distribu- 
tion, and inferences based on its f-statistics can be misleading whether or not those 
t-statistics are computed using HAC standard errors. Because of these drawbacks of 
the OLS estimator of 6, econometricians have developed a number of other estimators 
of the cointegrating coefficient. 

One such estimator of 0 that is simple to use in practice is the dynamic OLS 
(DOLS) estimator (Stock and Watson 1993). The DOLS estimator is based on a 


'The critical values in Table 171 are taken from Fuller (1976) and Phillips and Ouliaris (1990). Following 
a suggestion by Hansen (1992), the critical values in Table 171 are chosen so that they apply whether or 
not X, and Y, have drift components. 
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modified version of Equation (17.24) that includes past, present, and future values of 
the change in X; 


P 
Y, = Bo + 0X, + ` ôjAX,-j + Uy. (1725) 
J=—P 


Thus, in Equation (1725), the regressors are X;, AX,,),..., AX;—p. The DOLS esti- 
mator of 0 is the OLS estimator of 6 in the regression of Equation (1725). 

If X, and Y, are cointegrated, then the DOLS estimator is efficient in large samples. 
Moreover, statistical inferences about 6 and the ô’s in Equation (17.25) based on HAC 
standard errors are valid. For example, the t-statistic constructed using the DOLS esti- 
mator with HAC standard errors has a standard normal distribution in large samples. 

As an illustration, for a DOLS regression of the 90-day Treasury rate on the 
10-year Treasury rate, using the data in Figure 15.3 and p = 4 leads and lags, the 
DOLS estimate of the cointegrating coefficient is 1.02. The HAC standard error, 
computed using a truncation parameter of m = 5, is 0.05. Thus the null hypothesis 
that 0 = 1 cannot be rejected at the 10% significance level. This result, along with 
the finding that the term spread is stationary, is consistent with the theory of the term 
structure of interest rates. 


Extension to Multiple Cointegrated Variables 


The concepts, tests, and estimators discussed here extend to more than two variables. 
For example, if there are three variables, Y, X,,, and X>,, each of which is /(1), then 
they are cointegrated with cointegrating coefficients 0; and 6, if Y, — 0,X 1, — 02X, 1s 
stationary. When there are three or more variables, there can be multiple cointegrat- 
ing relationships. For example, consider modeling the relationship among three inter- 
est rates: the three-month rate (R3m), the one-year (R1y) rate, and the ten-year rate 
(R10y). If they are /(1), then the expectations theory of the term structure of interest 
rates suggests that they will all be cointegrated. One cointegrating relationship sug- 
gested by the theory is R10y, — R3m, and a second relationship is Rly, — R3m,. 
(The relationship R10y, — R1y,is also a cointegrating relationship, but it contains no 
additional information beyond that in the other relationships because it is perfectly 
multicollinear with the other two cointegrating relationships.) 

The EG-ADF procedure for testing for a single cointegrating relationship among 
multiple variables is the same as for the case of two variables except that the regression 
in Equation (1724) is modified so that both X4; and X>, are regressors; the critical values 
for the EG-ADF test are given in Table 171, where the appropriate row depends on the 
number of regressors in the first-stage OLS cointegrating regression. The DOLS esti- 
mator of a single cointegrating relationship among multiple X’s involves including the 
level of each X along with leads and lags of the first difference of each X. For additional 
discussion of cointegration methods for multiple variables, see Hamilton (1994). 

Even if economic theory does not suggest a specific value of the cointegrating 
coefficient, it is important to check whether the estimated cointegrating relationship 
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makes sense in practice. Because cointegration tests can be misleading (they can 
improperly reject the null hypothesis of no cointegration more frequently than they 
should, and frequently they improperly fail to reject the null hypothesis), it is espe- 
cially important to rely on economic theory, institutional knowledge, and common 
sense when estimating and using cointegrating relationships. 


Volatility Clustering and Autoregressive 
Conditional Heteroskedasticity 


The phenomenon that some times are tranquil, while others are not—that is, that 
volatility comes in clusters—shows up in many economic time series. This section 
presents a pair of models for quantifying volatility clustering or, as it is also known, 
conditional heteroskedasticity. 


Volatility Clustering 


The volatility of many financial and macroeconomic variables changes over time. For 
example, daily percentage changes in the Wilshire 5000 Total Market Index, shown 
in Figure 172, exhibit periods of high volatility, such as in 2001 and 2008, and other 
periods of low volatility, such as in 2004 and 2017 A series with some periods of low 
volatility and some periods of high volatility is said to exhibit volatility clustering. 
Because the volatility appears in clusters, the variance of the daily percentage price 


A 
Daily Percentage Changes in the Wilshire 5000 Total Market Index, 1990-2017 
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Daily percentage price changes in the Wilshire 5000 Total Market Index exhibit volatility clustering, in which there are 
some periods of high volatility, such as in 2008, and other periods of relative tranquility, such as in 2004. 
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change in the Wilshire 5000 can be forecasted, even though the daily price change 
itself is very difficult to forecast. 

Forecasting the variance of a series is of interest for several reasons. First, the variance 
of price changes for some asset is a measure of the risk of owning that asset: The larger the 
variance of daily stock price changes, the more a stock market participant stands to gain— 
or lose—on a typical day. An investor who is worried about risk would be less tolerant of 
participating in the stock market during a period of high—rather than low— volatility. 

Second, the value of some financial derivatives, such as options, depends on the 
variance of the underlying asset. An options trader wants the best available forecasts 
of future volatility to help him or her know the price at which to buy or sell options. 

Third, forecasting variances can improve the accuracy of forecast intervals. Sup- 
pose that you are forecasting the rate of inflation. If the variance of the forecast error 
is constant, then an approximate forecast confidence interval can be constructed 
using the standard error of the regression or final prediction error as discussed in 
Section 15.5. If, however, the variance of the forecast error changes over time, then 
the width of the forecast interval should change over time: At periods when inflation 
is subject to particularly large disturbances or shocks, the interval should be wide; 
during periods of relative tranquility, the interval should be tighter. If the forecast 
error changes slowly, then the pseudo out-of-sample forecast error estimate of the 
MSFE in Equation (15.22) can be used, but to capture more rapid changes in volatil- 
ity, such as those observed in Figure 17.2, other methods must be used. 

Volatility clustering can be thought of as clustering of the variance of the error 
term over time: If the regression error has a small variance in one period, its variance 
tends to be small in the next period, too. In other words, volatility clustering implies 
that the error exhibits time-varying heteroskedasticity. 

When data are observed at a high frequency, it is possible to measure volatility 
directly using a measure called realized volatility. When data are observed less fre- 
quently, it is possible to estimate a model of the volatility and use that to estimate 
current volatility. We address these two approaches in turn. 


Realized Volatility 


Suppose you have daily data on asset returns, like that shown in Figure 172. One way 
to estimate the volatility in a given month is to compute the sample variance of asset 
returns in that month. For asset returns measured at high frequency, the mean return is 
typically very small compared with the variation in the return, as is evident in Figure 172. 
For that reason, for asset returns, and more generally for series that can be measured at 
a high frequency, the volatility of the return is measured not by the sample variance but 
simply by its mean square. Accordingly, the h-period realized volatility of a variable X, is 
the sample root mean square of X computed over h consecutive periods: 


1 t 
rv! = J} yx (17.26) 
s=t-ht+1 


oe 


Daily Percentage Changes in the Wilshire 5000 Total Market Index, 20-day Realized 


Percent per day 
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Volatility Bands, and GARCH(1, 1) Bands, 2015-2017 
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The volatility of stock price changes varies considerably over the 2015-2017 period. The volatility bands are narrow 
when volatility is low and wide when it is high. The 20-day realized volatility bands (black) and GARCH(1, 1) bands 
(dark blue) are similar to each other. 


2016 2017 2018 


The 20-day realized volatility bands of the data in Figure 172 for 2015-2017 is 
plotted in Figure 173. As can be seen from the figure, the realized volatility bands 
provides a smooth measure of the volatility clustering evident in that figure. 

In practice, realized volatility is typically computed using higher-frequency data than 
just daily. For example, the stock of a major company might be traded sufficiently frequently 
that its price can be measured at five-minute intervals. If so, these five-minute intervals can 
be used to compute realized volatility for a day, or even for a period of hours within a day. 
High-frequency realized volatility is one of the tools used in high-frequency trading. 


Autoregressive Conditional Heteroskedasticity 


When data are observed less frequently, an alternative is to estimate a model of the 
evolution of the variance over time. Two models of volatility clustering are the 
autoregressive conditional heteroskedasticity (ARCH) model and its extension, 
the generalized ARCH (GARCH) model. 


ARCH. Consider the ADL(1, 1) regression 
Y, = Po + BiM-1 + WiX-1 + Up (1727) 


In the ARCH model, which was developed by the econometrician Robert Engle (1982; 
see the box “Nobel Laureates in Time Series Econometrics”), the error u, is modeled 
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as being normally distributed with mean 0 and variance a7, where a7 depends on past 
squared values of u, Specifically, the ARCH model of order p, denoted ARCH(p), is 


o? = ay + AE + au 5 E GR od Apup, (17.28) 


where ap, a), ..., &@, are unknown coefficients. If these coefficients are positive, 


p 
then if recent squared errors are large, the ARCH model predicts that the current 
squared error will be large in magnitude in the sense that its variance, ø?, is large. 
Although it is described here for the ADL(1, 1) model in Equation (1727), the 
ARCH model can be applied to the error variance of any time series regression 
model with an error that has a conditional mean of 0, including higher-order ADL 


models, autoregressions, and time series regressions with multiple predictors. 


GARCH. The GARCH model, developed by the econometrician Tim Bollerslev 
(1986), extends the ARCH model to let ø? depend on its own lags as well as lags of 
the squared error. The GARCH(p, q) model is 


o? = ag + oju] tore + asia» + pio? ++ b4F1-@ (1729) 


where a, @1, . . . , Qp, Ọ1, - - - , 6, are unknown coefficients. 

The ARCH model is analogous to a distributed lag model, and the GARCH 
model is analogous to an ADL model. As discussed in Chapter 16, the ADL model 
can provide a more parsimonious model of dynamic multipliers than can the distrib- 
uted lag model. Similarly, by incorporating lags of a7, the GARCH model can capture 
slowly changing variances with fewer parameters than the ARCH model. 

An important application of ARCH and GARCH models is to measuring and 
forecasting the time-varying volatility of returns on financial assets, particularly assets 
observed at high sampling frequencies such as the daily stock returns in Figure 172. In 
such applications, the return itself is often modeled as unpredictable, so the regres- 
sion in Equation (1727) includes only the intercept. 


Estimation and inference. ARCH and GARCH models are estimated by the 
method of maximum likelihood (Appendix 11.2). The estimators of the ARCH and 
GARCH coefficients are normally distributed in large samples, so in large samples, 
t-statistics have standard normal distributions, and confidence intervals can be con- 
structed as the maximum likelihood estimate + 1.96 standard errors. 


Application to Stock Price Volatility 


A GARCH(1, 1) model of the Wilshire 5000 daily percentage stock price changes, R, 
estimated using data on all trading days from January 2, 1990, through December 29, 
2017 is 


A 


R, = 0.063 (1730) 
(0.010) 
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6? = 0.013 + 0.088 u2; + 0.908024. (1731) 
(0.002) (0.008) (0.009) 


No lagged predictors appear in Equation (1730) because daily Wilshire 5000 percent- 
age price changes are essentially unpredictable. 

The two coefficients in the GARCH model (the coefficients on u?_, and a24) are 
both individually statistically significant at the 5% significance level. One measure of the 
persistence of movements in the variance is the sum of the coefficients on u?_, and o?_, 
in the GARCH model (Exercise 179). This sum (0.99) is large, indicating that changes in 
the conditional variance are persistent. Said differently, the estimated GARCH model 
implies that periods of high volatility in stock prices will be long lasting. This implication 
is consistent with the long periods of volatility clustering seen in Figure 172. 

The estimated conditional variance at date t, 67, can be computed using the 
residuals from Equation (1730) and the coefficients in Equation (1731). For the 
Wilshire 5000 returns, the GARCH(1, 1) model and the 20-day realized volatility 
provide quantitatively similar estimates of the time-varying standard deviation of 
returns. This can be seen in Figure 173, which focuses on the 2015-2017 sample 
period. During the first half of 2015, the conditional standard deviation bands are 
relatively tight, indicating lower levels of risk for investors holding a portfolio of 
stocks making up the Wilshire 5000. But in the second half of 2015 these conditional 
standard deviations widened, indicating greater daily stook price volatility. 

For these data, the realized volatility and GARCH bands are quantitatively simi- 
lar to each other. An advantage of realized volatility is that it measures the changing 
variance without making any modeling assumptions. An advantage of the GARCH 
model is that it can be used to forecast volatility; another advantage is that it can be 
used in applications in which the data are observed infrequently—for example, 
monthly or quarterly. In general, realized volatility and GARCH models provide two 
complementary ways to quantify volatility clustering. 


Forecasting with Many Predictors Using 
Dynamic Factor Models and Principal 
Components? 


Statistical agencies in developed economies regularly report data on hundreds or 
thousands of time series describing the macroeconomy. These data include detailed 
information from the national income and product accounts (consumption, invest- 
ment, imports, exports, and government spending), multiple series on price and wage 


?This section draws on the material in Section 14.5, which should be read first. 
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inflation, output and production by industry or sector, data on specific markets such 
as housing, and data for asset markets including interest rates and asset prices. Each 
of these series could potentially contain information that could improve macroeco- 
nomic forecasts. But as explained in Chapter 14, with many predictors — potentially 
more than the number of available time series observations—regressions estimated 
by OLS will provide poor out-of-sample performance. To take advantage of this 
wealth of data, other methods must be used. 

This section focuses on one such approach, which uses the principal components 
of the data set to reduce the number of coefficients to be estimated. The use of prin- 
cipal components for prediction was discussed in Section 14.5; that treatment is 
extended here to time series data. The framework for doing so is the dynamic factor 
model (DFM), which models the comovements of a large number of time series as 
arising from a small number of unobserved variables, the so-called dynamic factors. 
One of the steps in estimating a DFM is estimation of these unobserved factors using 
principal components. As discussed at the end of this section, the DFM can be used 
for purposes other than forecasting. 

The DFM is a widely used approach for forecasting with many time series pre- 
dictors, but it is not the only approach. Another method is to estimate a VAR with 
many predictors but to use shrinkage methods, including Bayesian methods, to esti- 
mate those coefficients. For a graduate textbook discussion of Bayesian estimation 
of VARs, see Kilian and Liitkepohl (2017). 


The Dynamic Factor Model 


A central empirical regularity of developed economies is that there are broad com- 
mon movements among macroeconomic variables: When there is strength in one part 
of the economy, there often is strength in other parts as well. At a horizon of several 
years, the common swings in many economic variables give rise to what are referred 
to as business cycles. Macroeconomic variables also move together at shorter hori- 
zons (months or quarters) and at longer horizons (decadal movements in long-term 
growth rates). Theories of macroeconomic fluctuations build on this empirical regu- 
larity of broadly observed comovements and attribute these comovements to a rela- 
tively small number of driving forces, such as productivity improvements, monetary 
policy, fiscal policy, and changes in demand or consumer preferences. 

The dynamic factor model captures this notion that there are a small number (r) 
of common factors, which drive the comovements among a large number (N) of time 
series variables. The DFM treats these driving factors as unobserved. Treating the 
factors as unobserved admits that macroeconomists do not know all the sources of 
macroeconomic fluctuations and that even if they did, those sources would be diffi- 
cult to measure directly (for example, technological progress is very difficult to mea- 
sure). In a DFM, observed macroeconomic variables, such as GDP growth and the 
unemployment rate, are modeled as depending on these common unobserved factors 
and on other omitted drivers or measurement error. 
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Stated mathematically, the DFM has two parts. The first relates each of the N 


observable variables, X;,, to the r factors F,,..., Fy plus an error term uj: 
Xi, = Ag + Agh + +++ + Ap Fy + ui = 1,...,N, (1732) 
where A;,,..., A; are unknown coefficients relating the r factors to the i'* observ- 


able variable and u;, is a mean 0 error term that represents omitted effects that are 
unique to X; (that is, not common across variables) and measurement error. 

The second part of the DFM specifies that the r factors follow a VAR. For nota- 
tional convenience, we write the VAR here with a single lag [that is, as a VAR(1)]; 
however, more lags can be included: 


Fy = Aybi- + AP- + 0+ + AEn- + me 
(1733) 
Fa = An Fira + AP- +0 + ApEn- + Nw 


where the A’s are unknown VAR coefficients and the 7’s are mean 0 error terms. The 
factor VAR in Equation (1733) is the extension to multiple variables (the r factors) 
of the two-variable VAR in Key Concept 171. 

The error term u;, is assumed to be uncorrelated across series and to be uncor- 
related with the factor VAR errors—that is, E(ujltj+~) = 0, i A j,and E(ujnjr+4) = 0 
for all k—so that all the common movements are associated with the common fac- 
tors. Because there is no intercept in Equation (1733), the factors have mean 0. 

The common component of X, is the part of X;, that is explained by the factors— 
that is, the predicted value of X; given the factors, based on the population coeffi- 
cients. In Equation (1732), it is Ay Fi, + ++: + Aj. The error term in Equation 
(1732), uin is called the idiosyncratic component of X; because it is the part of X; not 
explained by the common factors. In general, the idiosyncratic component can be 
serially correlated, which affects how forecasts are made using the DFM.’ 


The DFM: Estimation and Forecasting 


From the perspective of forecasting with many predictors, the DFM resolves the prob- 
lem of having many predictors by replacing the many available time series with a small 
number of factors. If the factors were observed, the A coefficients in Equation (1732) 
and the VAR coefficients in Equation (1733) therefore could be estimated by OLS. 
The difficulty, however, is that the factors are not observed. The factors can, however, 
be estimated by the principal components of the N observed X’s. These estimated 
factors can then be treated as data for the purpose of estimating the unknown DFM 
coefficients. 


3Equations (1732) and (1733) are the so-called static form of the DFM, which is the version of the DFM 
most directly amenable to principal components estimation. Other forms of the DFM, and other ways to 
estimate the factors, are discussed in Stock and Watson (2016). 
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Estimation of the DFM and the factors using principal components. The method of 
principal components described in Section 14.5 extends directly to the time series 
setting. As discussed in Section 14.5, the X variables must first be standardized using 
their in-sample means and standard deviations; then the principal components are 
computed using the standardized _X’s. In Section 14.5, the first r principal components 
were denoted PC,..., PC,. In the context of the DFM, these principal components 
are the estimates of the common factors, and their value at date tis denoted Fix ces „Ê, 
where the caret (^) indicates that the factor is estimated. If the factor model assump- 
tions are, in fact, correct, then the principal components are consistent estimates of 
the factors in the sense that predictions made using the factors (were they observed) 
and using the principal components will be the same when both N and T are large. 

Given the estimated factors Ê, ut Ê, the A and A coefficients of the DFM in 
Equations (17.32) and (17.33) can be sae by OLS, where the estimated factors 
are treated as data. 

It is tempting to interpret the principal components themselves; for example, one 
might want to interpret the first principal component (the first estimated factor) as 
measuring overall economic activity, the second as measuring price inflation, and so 
forth. Unfortunately, such interpretations generally are not justified. The reason is 
that the factors are identified only up to linear combinations; without further assump- 
tions, the factors themselves are not identified. Said differently, the common compo- 
nents of the series are identified in the dynamic factor model, but the factors 
themselves are not. For forecasting, this identification issue is irrelevant because the 
same forecasts will arise whether the factors or a linear combination of them is used 
(recall that, with OLS, the same prediction is made using, say, an intercept and the 
binary variable male as with an intercept and the binary variable female). 


Determining the number of factors. In Chapter 14, the number of principal compo- 
nents was determined by leave-m-out cross validation. This method entails randomly 
assigning data to the m subsamples and then estimating the coefficients on the m 
subsamples that omit those observations. Unfortunately, leave-m-out cross validation 
has two problems in time series data. First, the time series observations are not inde- 
pendent, so the omitted data in the left-out subsample are not independent of the 
estimation sample. Second, if a subsample, even a contiguous subsample, is omitted, 
additional observations are lost because of the lag structure in the model. 

For these reasons, determining the number of factors for DFMs tends to rely on 
scree plots and information criteria. 

The scree plot with time series data is the same as that with cross-sectional data 
and is explained in Section 14.5. 

Information criteria for determining the number of factors in a DFM have a similar 
structure to those used to determine the lag length for an autoregression [Equation (15.23)] 
or for a VAR [Equation (17.4)]. Specifically, the information criterion penalizes the sum 
of squared residuals for adding another factor. The information criterion approach to 
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determining r was introduced by Bai and Ng (2002). A specific criterion they propose, 
which has been found to work well in simulations, is 


1 N T n PO POES 
IC(r) = Int ÈDI = (Ap + Aaf t+ ita} 
i=lt= 
(1734) 


+ (57 int mini, T) | 


where the A’s are the OLS estimates of the A’s, estimated using the first r principal compo- 
nents as regressors, and the final term is the penalty for using r principal components. 

The Bai—Ng penalty in Equation (1734) increases proportionately to the number 
of factors r, with a constant of proportionality that depends on the number of vari- 
ables as well as the number of time series observations. When N = T, this penalty 
simplifies to 2 times the BIC penalty, [In(7)]/T. 

Estimation of the number of factors using the information criterion in Equation 
(1734) proceeds as for autoregressions and VARs: Among a set of candidate values 
of r, the estimated lag length is the value of r that minimizes IC(r). 


Forecasting using the estimated factors. There are two approaches to forecasting 
using the estimated factors, which parallel the iterated and direct approaches to 
multi-period forecasting described in Section 172. 

The starting point for both approaches is to extend Equation (1732) to an autore- 
gressive distributed lag model. Because u; is, in general, serially correlated, past val- 
ues of u; are useful for forecasting u; and thus X;. Accordingly, the argument leading 
to Equation (16.21) applies here, so that the serial correlation in u; implies that 
lagged values of X;, might be useful predictors as well. With these lagged terms added, 
Equation (1732) becomes 


Xu = Ag + Agf toe + ApFn + ByXgag + tt t Bp Xui-p t Uy (1735) 


The right-hand side of Equation (1735) depends on Fs ..., Fs which are 
unknown at date ¢ — 1; thus current values of the factors (or their principal compo- 
nents estimates) cannot be used as predictors. The iterated and direct forecasting 
approaches take two different tacks to address this problem. 

In the iterated approach, the contemporaneous values of the factors in Equa- 
tion (1735) are replaced by their forecasts from the estimated factor VAR. Thus the 
one-step ahead forecast for period T + 1, using data through period T, is 


A 


Xir+1\T = Ån + AaFirsar + +A, Bryar + BX ++ Ê Xir-p+1 (1736) 


where the A’s and ĝ’s are the estimates of the A’s and 8’s in Equation (1732) using 
Ê, ae Ê, and lagged X’s as regressors and where Fars aie Bryr are the one- 
step ahead forecasts of the factors computed using the factor VAR. Forecasts for hori- 
zons h > 1 are computed using the iterated VAR forecasts of the factors and of X;. 
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The direct approach builds on Key Concept 173. Specifically, the h-step ahead 
direct forecasting regression using the estimated factors is 


Xa = êo + BF hy to + 8 baon + Spar Xu-n t+ + Spe pXi-n—p + Yin 
(1737) 


where there are different regressions, and thus different 6 coefficients, at each forecast- 
ing horizon h. For a given horizon, the coefficients of Equation (1737) can be estimated 
by OLS, and the direct forecasts are then made using those estimated coefficients. 

Typically, the coefficients are estimated using data through a specific date, and 
then the coefficients are frozen and used for real-time forecasting. This introduces a 
subtlety for forecasting with DFMs: The final observations on the factors, which are 
used to make real-time forecasts, might not have appeared in the estimation data set. 
As discussed in Appendix 14.5, because the coefficients are estimated using the in- 
sample principal components, the same weights and standardizing means and vari- 
ances must be used to construct the principal components in the out-of-sample 
period as were used in the estimation sample. 


Other uses of DFMs. DFMs can be used for purposes other than forecasting. 

One such use is to construct economic indexes. If one has a large number of simi- 
lar series, it can be useful to have a single summary index that captures the common 
comovements. In this case, a model with a single factor can be appropriate. The esti- 
mate of the single factor (the first principal component) then summarizes the 
comovements of all the variables. This approach is commonly used to compute a 
coincident economic index from multiple measures of economic activity. 

Another use of DFMs is to estimate the current value of a variable. This problem 
arises because economic data are typically released with a lag. For example, one might 
be interested in the change of employment in the current month, but those data will 
not be released until next month. The task of “forecasting” current values of economic 
data is called noweasting. The main technical challenge of nowcasting is that data are 
released over the course of any month, so that the nowcasting model must be able to 
incorporate incoming data as they arrive. The DFM is well suited to doing so, but it 
must be adapted to handle missing observations, and those methods are beyond the 
scope of this book. The Federal Reserve Bank of New York uses a DFM to produce 
nowcasts of GDP, which it updates weekly based on that week’s data.* 


Application to U.S. Macroeconomic Data 


We illustrate the estimation and use of the dynamic factor model using a data set 
comprised of 131 quarterly macroeconomic time series for the United States, span- 
ning 1960:Q1—2017:Q4. The series are summarized in Table 172, with additional infor- 
mation provided in Appendix 171. The variables in the data set include standard 


“The New York Fed GDP nowcasts are posted at https://www.newyorkfed.org/research/policy/nowcast. 
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The Quarterly Macroeconomic Data Set 

Category Number of Series Used for Factor Estimation 
National Income and Product Accounts 13 

Industrial Production 8 

Employment and Unemployment 30 

Orders, Inventories, and Sales 6 

Housing Starts and Permits 6 

Prices 22 

Productivity and Labor Earnings 5 

Interest Rates 10 

Money and Credit 

International 

Asset Prices, Wealth, Household 10 

Balance Sheets 

Other 2 

Oil Market Variables 5 
| Total 131 J 


measures of economic activity, wage and price inflation, interest rates, and data on 
large markets of macroeconomic importance including housing and oil markets. The 
data were transformed to eliminate stochastic trends, typically by transforming to 
growth rates (as for GDP) or first differences (interest rates). These transformed data 
were then standardized by subtracting their sample mean and dividing by their sam- 
ple standard deviation prior to estimation. 

In some categories, series are available at multiple levels of aggregation. For 
example, GDP is the sum of consumption, investment, government spending, and 
imports; thus GDP is perfectly collinear with its components. Similarly, total employ- 
ment is the sum of employment across the sectors of the economy. For the purpose 
of estimating the factors, the aggregate series (GDP, total employment) provide no 
additional information beyond their components, so the aggregate series were 
excluded from the data set. The final column of Table 17.2 lists the number of series 
used to compute the principal component factor estimates. 

Figure 174 presents the scree plot of the first 30 principal components of the 131 
series in the data set, over the full 1960-2017 period. Evidently, a large amount of the 
variance of these series is captured by the first few principal components. The first 
principal component explains 20% of the total variance of the series, the second 
principal component explains 9%, and the first four collectively explain 39%. 

The scree plot provides some guidance about the number of factors to include. 
Clearly, the first and second factors are important, and there are also substantial 
drops in the marginal R? after the third and fourth factors. The decline does not seem 
to stabilize, however, until the tenth factor, so this visual analysis is inconclusive. The 
Bai-Ng information criterion [Equation (1734)] is minimized using r = 4 factors. 
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The first factor 
explains 20% 
of the total 
variance of 
the series, and 
the first four 
factors collec- 
tively explain 
39% of the 
total variance 
of the series. 


Scree Plot of First 30 Factors for the Macro Data Set, 1960-2017 
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This estimate is within the plausible range from the inspection of the scree plot, so 
we adopt r = 4 for the rest of this example. 

Figure 175 plots the four-quarter growth rate of GDP, employment, oil prices, 
and returns on the S&P 500 stock index (the four-quarter growth is the percentage 
growth of the series from quarter t to quarter t + 4, computed using the log approxi- 
mation to percentage changes).The figure also plots the common component of each 
of the series, estimated using four factors. Of these series, GDP and employment are 
not in the data set used to estimate the factors because they are aggregates of other 
included series, while the oil prices and stock returns are among the 131 series used 
to estimate the factors. 

The striking conclusion from Figure 175 is that the common component, com- 
puted using only the first 4 principal components of the 131 macro variables, captures 
a large amount of the variation in these series. Even a large fraction of the four- 
quarter returns on the S&P 500 are explained by these 4 factors. This does not imply 
that stock returns are predictable; rather, it implies that stock returns are heavily 
influenced by contemporaneous developments in aggregate economic activity. 

We conclude by examining forecasts of GDP growth made using the four esti- 
mated factors and comparing those to the AR and ADL forecasts in Chapter 15. We 
1, 4, and 8, 
where growth is measured at an annual rate. For example, at the four-quarter hori- 
zon, the dependent variable is 400lIn(GDP/GDP,_4), which equals the average of the 
Lt= 
forecasting models examined are direct forecasts of h-period growth corresponding 
to an AR(2), an ADL(2, 2) with the term spread, and a four-factor forecast that 
includes two lags of GDP growth. 


consider direct forecasts of cumulative GDP growth at horizons h = 


quarterly growth in periods t, t — 2, and t — 3 at an annual rate. The three 
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(c) Oil prices 


Wa 
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Four-Quarter Growth Rates, Actual and Common Components, 1960-2017 


The series (black) and estimated common components (blue) of GDP, employment, oil prices, and returns on the S&P 
500 based on a four-factor DFM, estimated using the 131-series macroeconomic data set, 1960-2017. 


(d) Stock prices 


Table 17.3 reports the performance of the forecasts as measured by the pseudo 
out-of-sample root mean square forecast error, RMSFEpoos [Equation (15.22)]. 
The first column lists the regressors in the direct forecasting regressions. Following 
Section 15.8, the in-sample period starts in 1981:Q1 and ends h periods prior to 
2002:Q4; the pseudo out-of-sample period is 2002:04-2017:04. 

Three aspects of these results are noteworthy. First, the RMSFEpoos decreases as 
the horizon lengthens. One reason for this improvement at longer horizons is that 
quarterly GDP has a large amount of transitory measurement error, which is smoothed 
over (averaged out) by considering growth rates over one or two years. This quarterly 
“noise” is evident in the time series plot of quarterly GDP growth in Figure 15.1b. 

Second, at all horizons the forecasts that use the term spread do worse in the 
out-of-sample period than the direct AR(2) forecasts. This would appear to contra- 
dict the improvement in in-sample fit provided by the term spread: The F-statistic 
testing whether the coefficients on TSpread,_; and TSpread,_y are 0 in the h = 1 
estimation sample (1981:Q1—2002:Q3) is statistically significant at the 1% level. Evi- 
dently, the coefficients on the lagged term spread estimated in the in-sample period 
do not capture the relation between the term spread and GDP in the pseudo out-of- 
sample period, an indication that this relation is nonstationary. In real-world terms, 
one important difference between the in- and out-of-sample periods is that, starting 
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Comparison of Direct Forecasts of Cumulative GDP Growth at an Annual Rate: 
Lagged GDP, Term Spread, and Principal Components, 2002:Q4-2017:Q4 


RMSFEo0s 
Predictors h=1 h=4 h=8 
GDPGR,_ ;, GDPGR,-n-1 2.25 1.91 1.74 
GDPGR,_;,, GDPGR,_;,_,, TSpread,_};,, TSpread,—p—1 2.29 1.94 1.77 
GDPGR,_;,, GDPGR,-n-1, Fcp Piin Bcn Fih 2.14 1.40 1.48 


Entries are root mean square forecast errors, estimated by pseudo out-of-sample forecasts for the forecast period 

2002:04-2017:04 [Equation (15.22)]. The forecasting models were estimated using data from 1981:Q1 through h periods 

before 2002:04, where A is the forecast horizon. The dependent variable is the h-quarter cumulative growth in GDP at 

an annual rate, using log points—that is, (400/h)In(GDP,/GDP,_;,). The regressors are given in the first column, where 

Fi denotes the first factor estimated by the first principal component in the estimation sample and so on. All regressions 
pou: an intercept. 


in 2008, the Federal Reserve Board introduced new monetary policy tools to manage 
long-term as well as short-term rates, thereby changing the relation between the term 
spread and economic activity. 

Third, the factor forecasts improve upon the AR and ADL forecasts at all horizons. 
Closer inspection of the forecasts reveals that this improvement is due to much better 
performance of the factor forecasts during the recession and early recovery following the 
financial crisis in the fall of 2009. During this recession, the strong negative comovements 
across many macro variables pointed toward a deep recession, a feature missed by the 
AR forecast. In contrast, during the relatively quiescent periods of 2005 and after 2013, 


the AR(2) direct forecast actually performs slightly better than the factor forecast. 


Nobel Laureates in Time Series Econometrics 


| n 2003, Robert Engle and Clive Granger won 
the Nobel Prize in Economics for fundamen- 
tal theoretical research in time series economet- 
rics. Engle’s work was motivated by the volatility 
clustering evident in plots like Figure 17.2. Engle 


wondered whether series like these could be 


stationary and whether 
econometric models 
could be developed to 
explain and predict their 
time-varying volatil- 


ity. Engle’s answer was 


John McCombe/AP Images 


to develop the autore- 


gressive conditional 


heteroskedasticity (ARCH) model, described in 
Section 17.5. The ARCH model and its extensions 
proved especially useful for modeling the volatil- 
ity of asset returns, and the resulting volatility 
forecasts are used to price financial derivatives 


and to assess changes over time in the risk of hold- 


ing financial assets. Today, 
measures and forecasts of 
volatility are a core com- 
ponent of financial econo- 
metrics, and the ARCH 
model and its descendants 


are the workhorse tools for 


New York University/AFP/Newscom 


Robert F. Engle 


modeling volatility. 
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Granger’s work focused on how to handle sto- 
chastic trends in economic time series data. From 
his earlier work, he knew that two unrelated series 
with stochastic trends could, by the usual statistical 
measures of t-statistics and regression R”’s, falsely 
appear to be meaningfully related; this is the “spuri- 
ous regression” problem exemplified by the regres- 
sions in Equations (14.28) and (14.29). But are all 
regressions involving stochastic trending variables 
spurious? Granger discovered that when vari- 
ables shared common trends—in his terminology, 
were “co-integrated” —meaningful relationships 
could be uncovered by regression analysis using 
a vector error correction model. The methods of 
cointegration analysis are now a staple in modern 
macroeconometrics. 

In 2011, Thomas Sargent and Christopher Sims 
won the Nobel Prize for their empirical research on 
cause and effect in the macroeconomy. Sargent was 
recognized for developing models that featured the 


important role that expectations about the future 


play in disentangling cause 


and effect. Sims was rec- 


ognized for developing 
structural VAR (SVAR) 
models. Sims’s key insight 


concerned the forecast 


Julio Cortez/AP Images 


errors in a VAR model— 


Christopher A. Sims the u, errors in Equations 
(17.1) and (17.2). These 
errors, he realized, arose 
because of unforeseen 
“shocks” that buffeted the 
macroeconomy, and in 


many cases, these shocks 


Karl Schoendorfer/Shutterstock 


had well-defined sources 


Lars Peter Hansen 


like the Organization of 
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Petroleum Exporting Countries (oil price shocks), 
the Fed (interest rate shocks), or Congress (tax 
shocks). By disentangling the various sources of 
shocks that comprise the VAR errors, Sims was able 
to estimate the dynamic causal effect of these shocks 
on the variables appearing in the VAR. This dis- 
entangling of shocks is never without controversy, 
but SVARs are now a standard tool for estimating 
dynamic causal effects in macroeconomics. 

In 2013, Eugene Fama, Lars Peter Hansen, and 
Robert Shiller won the Nobel Prize for their empiri- 
cal analysis of asset prices. The work in the box 
“Can You Beat the Market?” in Chapter 15 and 
the box “NEWS FLASH: Commodity Traders Send 
Shivers Through Disney World” in Chapter 16 was 
motivated in part by the “efficient markets” (unpre- 
dictability) work of Fama and the “irrational exu- 
berance” (unexplained volatility) work of Shiller. 
Hansen was honored for developing generalized 
method of moments (GMM) methods to investigate 
whether asset returns are consistent with expected 
utility theory. Microeconomics says that investors 
should equate the marginal cost of an investment 
(today’s foregone utility from investing rather than 
consuming) with its marginal benefit (tomorrow’s 
boost in utility from consumption financed by the 
investment’s return). But a simple test of this propo- 
sition is complicated because marginal utility is dif- 
ficult to measure, asset returns are uncertain, and the 
argument should hold across all asset returns. Hansen 
developed GMM methods to test asset-pricing mod- 
els. As it turned out, Hansen’s GMM methods had 
applications well beyond finance and are now widely 
used in econometrics. Section 19.7 introduces GMM. 

For more information on these and other Nobel 
laureates in economics, visit the Nobel Foundation 


website, http://www.nobel.se/economics. 
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17.7 Conclusion 


This part of the text has covered some of the most frequently used tools and concepts 


of time series regression. Many other tools for analyzing economic time series have 


been developed for specific applications. If you are interested in learning more about 


economic forecasting, see the introductory textbooks by Diebold (2017) and Enders 


(2009). For an advanced treatment of econometrics with time series data, see 


Hamilton (1994) and Hayashi (2000). For an advanced treatment of vector autore- 


gressions, see Kilian and Liitkepohl (2017), and for more on dynamic factor models, 
see Stock and Watson (2016). 


Summary 


1, 


Vector autoregressions model k time series variables, with each depending on 
its own lags and the lags of the k — 1 other series. The forecasts of each of the 
time series produced by a VAR are mutually consistent in the sense that they 
are based on the same information. 

Forecasts two or more periods ahead can be computed either by iterating for- 
ward a one-step ahead model (an AR or a VAR) or by estimating a multi- 
period ahead regression. 

Two series that share a common stochastic trend are cointegrated; that is, Y, 
and X, are cointegrated if Y, and X, are /(1) but Y, — 0X, is 1(0). If Y, and X, 
are cointegrated, the error correction term Y, — 0X, can help predict AY, and/ 
or AX, A vector error correction model is a VAR model of AY, and AX, aug- 
mented to include the lagged error correction term. 

Volatility clustering—in which the variance of a series is high in some periods 
and low in others—is common in economic time series, especially financial 
time series. Realized volatility is an estimate of time-varying volatility using a 
rolling root mean square estimator. 

The ARCH model of volatility clustering expresses the conditional variance 
of the regression error as a function of recent squared regression errors. The 
GARCH model augments the ARCH model to include lagged conditional 
variances as well. Realized volatility and ARCH/GARCH models produce 
forecast intervals with widths that depend on the volatility of the most recent 
regression residuals. 

The comovements of a large number of time series sometimes can be sum- 
marized by the first few principal components, which in turn can be used for 
forecasting. The framework for doing so is the dynamic factor model, which 
posits that a small number of unobserved factors drive the comovements of a 
large number of macroeconomic variables. 
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Review the Concepts 


17.1 A macroeconomist wants to construct forecasts for the following macro- 
economic variables: GDP, consumption, investment, government purchases, 
exports, imports, short-term interest rates, long-term interest rates, and the 
rate of price inflation. He has quarterly time series for each of these variables 
from 1970 to 2017 Should he estimate a VAR for these variables and use this 
for forecasting? Why or why not? Can you suggest an alternative approach? 


17.2 Suppose that Y, follows a stationary AR(1) model with By = 0 and B, = 0.5. If 
Y, = 10, what is your forecast of Y,,, (that is, what is Y,,2);)? What is Y,- nj: for 
= 20? Does this forecast for h = 20 seem reasonable to you? 


17.3 A version of the permanent income theory of consumption implies that the 
logarithm of real GDP (Y) and the logarithm of real consumption (C) are coin- 
tegrated with a cointegrating coefficient equal to 1. Explain how you would inves- 
tigate this implication by (a) plotting the data and (b) using a statistical test. 


17.4 What is volatility clustering? Explain two models that are used to describe 
data processes with volatility clustering. 
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17.5 


What is a unit root? How does a researcher test for the presence of a unit root 
in the data? 


Exercises 


17.1 


17.2 


17.3 


17.4 


17.5 


Suppose that Y, follows a stationary AR(1) model, Y, = By + B,Y;-1 + uy 


a. Show that the h-period ahead forecast of Y, is given by 
Yint = By + Bi(Y, — my), where wy = Bo/(1 — Bı). 

b. Suppose that X;is related to Y, by X, = Sj2o S'Y; ilo where |6| < 1. 
Show that X, = [my/(1 - 8)] + [(% - uy)/( — B1ô)]. 


One version of the expectations theory of the term structure of interest rates 
holds that a long-term rate equals the average of the expected values of short- 
term interest rates into the future plus a term premium that is /(0). Specifically, 
let Rk, denote a k-period interest rate, let R1, denote a one-period interest 
rate, and let e, denote an /(0) term premium. Then Rk, = SEER, ilt t en 
where R1,, ;, 1s the forecast made at date ¢ of the value of R1 at date ¢ + i. 
Suppose that R1, follows a random walk so that R1, = R1,-, + u, 


a. Show that Rk, = R1, + e. 


b. Show that Rk, and R1, are cointegrated. What is the cointegrating 
coefficient? 


c. Now suppose that AR1, = 0.5AR1,-; + u, How does your answer 
to (b) change? 

d. Now suppose that R1, = 0.5R1,_; + u, How does your answer to 
(b) change? 


Suppose that E(u; | u;—1, u;—-2,..-) = 0 and u, follows the ARCH process, 
of = 1.0 + 0.5 u24. 


a. Let E(u?) = var(u,) be the unconditional variance of u, Show 
that var(u,) = 2. (Hint: Use the law of iterated expectations, 
E(u?) = E[ Eu; | u,-1)].) 

b. Suppose that the distribution of u, conditional on lagged values of u, is 
N(0, o°). If u,-1 = 0.2, what is Pr(—3 < u, = 3)? If u,_, = 2.0, what is 
Pr(—3 S u, S 3)? 


Suppose that Y, follows the AR(p) model Y, = By + BiY,-1 + +++ + BpYi-p + 
u,, where E(u,| Y,1, Y-2,...) = 0.Let Yaa = E(Y;+r| Yo Yi-1--.).Show 
that Yne = Bo + Bi%-1+a + °°" + Bo —ptaj for h > p. 

Verify Equation (17.20). [Hint: Use X LY? = S/.,(¥,_, + AY,” to show 
that ELY? = SY 4 +25XL1Y_14Y, + SLAY, and solve for 
Xi- %-14 ¥,] 


17.6 


17.7 


17.8 


17.9 


17.10 
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A regression of Y, onto current, past, and future values of X, yields 
Y, = 2.0 + 1.5X41 + 0.9X, — 0.3.X,-1 + uy 
a. Rearrange the regression so that it has the form shown in Equation (1725). 
What are the values of 0, 8—4, ôo, and 6,? 
b. i. Suppose that X, is (0) and u,is I(0). Are Y and X cointegrated? 

ii. Suppose that X, is /(1) and Y,is (1). Are Y and X cointegrated? 

iii. Suppose that X, is /(1) and u, is /(0). Are Y and X cointegrated? 
Suppose that AY, = u, where u, is i.i.d. N(0, 1), and consider the regression 
Y, = BX, + error, where X, = AY,,, and error is the regression error. Show 
that B 1, 5 (xt — 1). [Hint: Analyze the numerator of B using analysis 
like that in Equation (1721). Analyze the denominator using the law of large 
numbers. | 
Consider the following two-variable VAR model with one lag and no intercept: 

Y, = BuYi-1 + yuXi-1 + Uy 
X, = Bor%i-1 + YaX—1 + Uz 
a. Show that the iterated two-period ahead forecast for Y can be written as 


¥ir-2 = 61¥;-2 + 62X;-2, and derive values for ô; and ô, in terms of the 
coefficients in the VAR. 


b. In light of your answer to (a), do iterated multi-period forecasts differ 
from direct multi-period forecasts? Explain. 


a. Suppose that E(u; | u;—1, u;—2,...) = 0, that var(u, |u;—1, u;—2, .. . ) fol- 
lows the ARCH(1) model o? = ag + œu? 4, and that the process for u, is 
stationary. Show that var(u,) = a@o/(1 — a). (Hint: Use the law of iterated 
expectations, E(u?) = E| E(u? | u,-)].) 

b. Extend the result in (a) to the ARCH(p) model. 

c. Show that >?_, a; < 1 for a stationary ARCH(p) model. 

d. Extend the result in (a) to the GARCH(1, 1) model. 

e. Show that a, + ¢, < 1 for a stationary GARCH(1, 1) model. 


Consider the cointegrated model Y, = 0X, + vyand X, = X -1 + vap where vyz 
and vz are mean 0 serially uncorrelated random variables with E(v;,v2;) = 0 
for all ż and j. Derive the vector error correction model [Equations (17.22) and 
(1723)] for X and Y. 
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E171 


This exercise is an extension of Empirical Exercise 14.1. On the text website, http:// 
www.pearsonglobaleditions.com, you will find the data file USMacro_Quarterly, 
which contains quarterly data on several macroeconomic series for the United 
States; the data are described in the file USMacro_Description. Compute 
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inflation, /nfl, using the price index for personal consumption expenditures. 
For all regressions, use the sample period 1963:Q1-2017:Q4 (where data before 
1963 may be used as initial values for lags in regressions). 


a. Using the data on inflation through 2017:Q4 and an estimated AR(2) model: 


i. Forecast A/nfl918:Q1, the change in inflation from 2017:Q4 to 2018:Q1. 


ii. Forecast AJnflo18:Q2, the change in inflation from 2018:Q1 to 
2018:Q2. (Use an iterated forecast.) 


iii. Forecast Infl918:02 — Infbo17:Q4, the change in inflation from 2017:Q4 
to 2018:Q2. 
iv. Forecast Inflo18:Q2, the rate of inflation in 2018:Q2. 
b. Repeat (a) using the direct forecasting method. 


E172 On the text website, http://www.pearsonglobaleditions.com, you will find 
the data file USMacro_Quarterly, which contains quarterly data on real 
GDP, measured in 2009 dollars. Compute GDPGR, = 400 x [In(GDP,) — 
In(GDP,_,)], the GDP growth rate. 


a. Using data on GDPGR, from 1960:Q1 to 2017:04, estimate an AR(2) 
model with GARCH(1, 1) errors. 


b. Plot the residuals from the AR(2) model along with +6, bands as in 
Figure 173. 


c. Some macroeconomists have claimed that there was a sharp drop in the 
variability of the growth rate of GDP around 1983, which they call the Great 
Moderation. Is this Great Moderation evident in your plot for (b)? Explain. 


The Quarterly U.S. Macro Data Set 


The variables in the quarterly U.S. data set were obtained from the FRED online database of 
macroeconomic time series maintained by the Federal Reserve Bank of St. Louis. The catego- 
ries of variables are listed in Table 172. The National Income and Product Account variables 
included in the data set for estimating the factors are three measures of personal consumption 
expenditures (durable goods, nondurable goods, and services); four measures of private invest- 
ment (nonresidential structures, nonresidential intellectual property, nonresidential fixed equip- 
ment, and residential structures), federal government expenditures, federal government receipts, 
state and local government consumption, exports, and imports (all real). Stochastic trends were 
eliminated by (in most cases) computing quarterly growth rates or first differences. For details 


and for the full list of series, see the online documentation supporting this text. 


The Theory of Linear Regression 
8 with One Regressor 


hy should an applied econometrician bother learning any econometric theory? 
Woe are several reasons. Learning econometric theory turns your statistical 
software from a “black box” into a flexible tool kit from which you are able to select 
the right tool for the job at hand. Understanding econometric theory helps you 
appreciate why these tools work and what assumptions are required for each tool to 
work properly. Perhaps most importantly, knowing econometric theory helps you 
recognize when a tool will not work well in an application and when you should look 
for a different econometric approach. 

This chapter provides an introduction to the econometric theory of linear 
regression with a single regressor. This introduction is intended to supplement—not 
replace—the material in Chapters 4 and 5, which should be read first. 

This chapter extends Chapters 4 and 5 in two ways. 

First, it provides a mathematical treatment of the sampling distribution of the 
ordinary least squares (OLS) estimator and t-statistic, both in large samples under 
the three least squares assumptions for causal inference of Key Concept 4.3 and in 
finite samples under the two additional assumptions of homoskedasticity and 
normal errors. These five extended least squares assumptions are laid out in 
Section 18.1. Sections 18.2 and 18.3, augmented by Appendix 18.2, mathematically 
develop the large-sample normal distributions of the OLS estimator and t-statistic 
under the first three assumptions (the least squares assumptions for causal inference 
of Key Concept 4.3). Section 18.4 derives the exact distributions of the OLS estimator 
and t-statistic under the two additional assumptions of homoskedasticity and nor- 
mally distributed errors. 

Second, this chapter extends Chapters 4 and 5 by providing an alternative 
method for handling heteroskedasticity. The approach of Chapters 4 and 5 is to 
use heteroskedasticity-robust standard errors to ensure that statistical inference is 
valid even if the errors are heteroskedastic. This method comes with a cost, 
however: If the errors are heteroskedastic, then in theory a more efficient estimator 
than OLS is available. This estimator, called weighted least squares, is presented in 
Section 18.5. Weighted least squares requires a great deal of prior knowledge 
about the precise nature of the heteroskedasticity—that is, about the conditional 
variance of u given X. When such knowledge is available, weighted least squares 
improves upon OLS. In most applications, however, such knowledge is unavailable; 
in those cases, using OLS with heteroskedasticity-robust standard errors is the 
preferred method. 
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18.1 


The Extended Least Squares Assumptions 
and the OLS Estimator 


This section introduces a set of assumptions that extend and strengthen the three least 
squares assumptions for causal inference of Chapter 4. These stronger assumptions are 
used in subsequent sections to derive stronger theoretical results about the OLS 
estimator than are possible under the weaker (but more realistic) assumptions of 
Chapter 4. 


The Extended Least Squares Assumptions 


Extended least squares Assumptions 1, 2, and 3. The first three extended least squares 
assumptions are the three assumptions given in Key Concept 4.3: The conditional mean 
of u; given X; is 0; (X;, Y), i = 1,...,n, are independent and identically distributed 
(i.i.d.) draws from their joint distribution; and X; and u; have nonzero finite fourth 
moments. 

Under these three assumptions, the OLS estimator is unbiased, is consistent, and 
has a normal sampling distribution in large samples. If these three assumptions hold, 
then the methods for inference introduced in Chapter 4—hypothesis testing using 
the t-statistic and construction of 95% confidence intervals as +1.96 standard 
errors—are justified when the sample size is large. To develop a theory of efficient 
estimation using OLS or to characterize the exact sampling distribution of the OLS 
estimator, however, requires stronger assumptions. 


Extended least squares assumption 4. The fourth extended least squares assumption 
is that u; is homoskedastic; that is, var(u; | X) = 07, where ø% is a constant. As seen 
in Section 5.5, if this additional assumption holds, then the OLS estimator is efficient 


among all linear estimators that are unbiased, conditional on X;,..., Xp 


Extended least squares assumption 5. The fifth extended least squares assumption 
is that the conditional distribution of u; given X; is normal. 

Under extended least squares assumptions 1, 2, 4, and 5, u; is i.i.d. N(0, 02), and 
u; and X; are independently distributed. To see this, note that the fifth extended least 
squares assumption states that the conditional distribution of u; | X; is N (0, var(u; | X;)), 
where the distribution has mean 0 by the first extended least squares assumption. By 
the fourth extended least squares assumption, however, var(u; | X) = 02,so the con- 
ditional distribution of u; | X; is N(0, 72). Because this conditional distribution does 
not depend on X;, u; and X; are independently distributed. By the second extended 
least squares assumption, u; is distributed independently of u; for all j # i. It follows 
that, under extended least squares assumptions 1, 2, 4, and 5, u; and X; are indepen- 
dently distributed and w;is i.i.d. N(0, 02). 
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The Extended Least Squares Assumptions 
for Regression with a Single Regressor 18.1 


The linear regression model with a single regressor is 
Y = Bo + BX, tupi =1,...,0, (18.1) 
where £; is the causal effect on Y of X. 


The extended least squares assumptions are 

1. E(u; | X) = 0 (conditional mean 0); 

2. (X, Y¥),i = 1,...,, are independent and identically distributed (i.i.d.) 
draws from their joint distribution; 

3. X; and u; have nonzero finite fourth moments; 

4. var(u;| X) = o? (homoskedasticity); and 


5. The conditional distribution of u; given X; is normal (normal errors). 


It is shown in Section 18.4 that, if all five extended least squares assumptions 
hold, the OLS estimator has an exact normal sampling distribution, and the 
homoskedasticity-only t-statistic has an exact Student ¢ distribution. 

The fourth and fifth extended least squares assumptions are much more restric- 
tive than the first three. Although it might be reasonable to assume that the first three 
assumptions hold in an application, the final two assumptions are less realistic. Even 
though these final two assumptions might not hold in practice, they are of theoretical 
interest because if one or both of them hold, then the OLS estimator has additional 
properties beyond those discussed in Chapters 4 and 5. Thus we can enhance our 
understanding of the OLS estimator and the theory of estimation in the linear regres- 
sion model by exploring estimation under these stronger assumptions. 

The five extended least squares assumptions for the single-regressor model are 
summarized in Key Concept 18.1. 


The OLS Estimator 


For easy reference, we restate the OLS estimators of Bp and £; here: 


Dw - HY -¥) 
A == (18.2) 
EX ~ XP 
Ê = Y - Bx. (18.3) 


Equations (18.2) and (18.3) are derived in Appendix 4.2. 
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18.2 


Fundamentals of Asymptotic 
Distribution Theory 


Asymptotic distribution theory is the theory of the distribution of statistics—estima- 
tors, test statistics, and confidence intervals—when the sample size is large. Formally, 
this theory involves characterizing the behavior of the sampling distribution of a sta- 
tistic along a sequence of ever-larger samples. The theory is asymptotic in the sense that 
it characterizes the behavior of the statistic in the limit as n > ©. 

Even though sample sizes are, of course, never infinite, asymptotic distribution 
theory plays a central role in econometrics and statistics for two reasons. First, if the 
number of observations used in an empirical application is large, then the asymptotic 
limit can provide a high-quality approximation to the finite sample distribution. Sec- 
ond, asymptotic sampling distributions typically are much simpler, and thus easier to 
use in practice, than exact finite-sample distributions. Taken together, these two reasons 
mean that reliable and straightforward methods for statistical inference—tests using 
t-statistics and 95% confidence intervals calculated as + 1.96 standard errors—can be 
based on approximate sampling distributions derived from asymptotic theory. 

The two cornerstones of asymptotic distribution theory are the law of large num- 
bers and the central limit theorem, both introduced in Section 2.6. We begin this section 
by continuing the discussion of the law of large numbers and the central limit theorem, 
including a proof of the law of large numbers. We then introduce two more tools, 
Slutsky’s theorem and the continuous mapping theorem, that extend the usefulness of 
the law of large numbers and the central limit theorem. As an illustration, these tools 
are then used to prove that the distribution of the t-statistic based on Y testing the 
hypothesis E(Y) = po has a standard normal distribution under the null hypothesis. 


Convergence in Probability and the Law of Large Numbers 


The concepts of convergence in probability and the law of large numbers were intro- 
duced in Section 2.6. Here we provide a precise mathematical definition of conver- 
gence in probability, followed by a statement and proof of the law of large numbers. 


Consistency and convergence in probability. Let S,, S),...,S,,... be a sequence 
of random variables. For example, S„ could be the sample average Y of a sample of 
n observations of the random variable Y. The sequence of random variables {S,,} is 
said to converge in probability to a limit, u (that is, S,, — 1), if the probability 
that S,, is within +6 of u tends to 1 as n > œ, as long as the constant 6 is positive. 
That is, 


S, —> wif and only if Pr(|S, — u| = 6) — 0 (18.4) 


asn— © forevery ô > 0.IfS, > p, then S, is said to be a consistent estimator 
of u. 
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The law of large numbers. The law of large numbers says that, under certain condi- 
tions on Y;,..., Y, the sample average Y converges in probability to the popula- 
tion mean. Probability theorists have developed many versions of the law of large 
numbers, corresponding to various conditions on Yj,..., Y,. The version of the 
law of large numbers used in this text is that Yj,..., Y, are i.i.d. draws from a 
distribution with finite variance. This law of large numbers (also stated in 
Key Concept 2.6) is 


if Y;,..., Y, areiid., E(Y) = uy, and var(Y;) < ©, then Y > py (18.5) 


The idea of the law of large numbers can be seen in Figure 2.8: As the sample size 
increases, the sampling distribution of Y concentrates around the population mean, 
uy. One feature of the sampling distribution is that the variance of Y decreases as 
the sample size increases; another feature is that the probability that Y falls outside 
+6 of uy vanishes as n increases. These two features of the sampling distribution are, 
in fact, linked, and the proof of the law of large numbers exploits this link. 


Proof of the law of large numbers. The link between the variance of Y and the prob- 
ability that Y is within +6 of uy is provided by Chebychev’s inequality, which is 
stated and proven in Appendix 18.2 [see Equation (18.42)]. Written in terms of Y, 
Chebychev’s inequality is 


= var (Y 
Pr(|Y — py| = 6) =< a (18.6) 
for any positive constant ô. Because Y., ..., Y, are i.i.d. with variance o}, 


var (Y) = ø$ /n;thus, for any ô > 0, var(Y)/5* = o4/(6’n) —> 0. It follows from 
Equation (18.6) that Pr(| Y — uy| = 6) —> 0 for every 5 > 0, proving the law of 
large numbers. 


Some examples. Consistency is a fundamental concept in asymptotic distribution 
theory, so we present some examples of consistent and inconsistent estimators of 
the population mean, wy. Suppose that Y, i = 1,...,n, are i.i.d. N(0, oy), where 
0 < o} < ~. Consider the following three estimators of py: (1) m, = Yj; 
(2) m = (=A E} aY, where 0 < a < 1;and (3) m, = Y + 1/n. Are these 


I= a 


estimators consistent? 

The first estimator, m,, is just the first observation, so E(m,) = E(Y;) = by 
and m, is unbiased. However, m, is not consistent: Pr(|m, — wy| = 6) = 
Pr(| Y; — wy| = 8), which must be positive for sufficiently small 5 (because a} > 0), 
so Pr(| m, — uy | = 8) does not tend to 0 as n— œ% and m, is not consistent. This 
inconsistency should not be surprising: Because m, uses the information in only one 
observation, its distribution cannot concentrate around py as the sample size 
increases. 
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The second estimator, mp, is unbiased but is not consistent. It is unbiased because 


1 — q” -ln oS 1 — q” -ln "E 
Bom) = | (1E) Ba'n] = (1) Se tay = wy 
i= i= 


1-a 1-a 
1 —- a’ 


n 00 
since a’! = (1 — a") Sida’ = 
iF i=0 1 


The variance of m, is 


1 = gN? 2 ; 1-a@)11 - 2 14a —- 
LYS inog = of! a") a) — 28 a")( a) 


varo} = € =0] S z Ya — @)(1 — a"y Ya = a). + ay 


which has the limit var(m,) > o}(1 — a)/(1 + a) > 0 as n —> œ. Because Y is nor- 
mally distributed, m, is normally distributed with mean py and the variance given 
above. Thus m, has a positive probability of falling outside any interval around py, 
so Pr(|m, — uy | = ô) does not tend to 0 and my is inconsistent. This is perhaps 
surprising because this estimator uses all the observations. Most of the observa- 
tions, however, receive very small weight (the weight of the i™ observation is pro- 


portional to a’! 


, a very small number when / is large), and for this reason, there is 
an insufficient amount of cancellation of sampling errors for the estimator to be 
consistent. 

The third estimator, m,, is biased but consistent. Its bias is 1 /n: E(m,) = 
E(Y + 1/n) = py + 1/n,s0 the bias tends to 0 as the sample size increases. To see 
why m, is consistent, Pr( |m, — uy| = 6) = Pr(| Y + 1/n — py| = 8). Now, from 
Equation (18.43) in Appendix 18.2, a generalization of Chebychev’s inequality 
implies that for any random variable W, Pr(| W | = 6) = E(W’) / 68? for any positive 
constant 6. Thus Pr(|¥+1/n — py| = 8) < E[(Y + 1/n — py] /8. But 
E{(¥ + 1/n — py)?] = var(Y) + 1/n* = o? /n + 1/n? — 0 as n grows large. 
It follows that Pr(| Y + 1/n — py| = 6) —> 0 and m, is consistent. This example 
illustrates the general point that an estimator can be biased in finite samples but if 
that bias vanishes as the sample size gets large, the estimator can still be consistent 
(Exercise 18.10). 


The Central Limit Theorem and Convergence 
in Distribution 


If the distributions of a sequence of random variables converge to a limit as n > ~, 
then the sequence of random variables is said to converge in distribution. The central 
limit theorem says that, under general conditions, the standardized sample average 
converges in distribution to a normal random variable. 


Convergence in distribution. Let A, F», ..., F,, ... be a sequence of cumulative dis- 
tribution functions corresponding to a sequence of random variables, $),55,...,5,,.... 
For example, S, might be the standardized sample average, (Y — py)/oy. 
Then the sequence of random variables S, is said to converge in distribution 
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to S (denoted S, as S) if the distribution functions {F,,} converge to F, the distribution 
of S. That is, 


S, —— Sif and only if lim F, = F(2), (18.7) 
i o 


where the limit holds at all points ¢ at which the limiting distribution F is continuous. 
The distribution F is called the asymptotic distribution of S,,. 

It is useful to contrast the concepts of convergence in probability (—>) and 
convergence in distribution (—5). If S, — u, then S„ becomes close to u with 
high probability as n increases. In contrast, if S,, —L> § then the distribution of S, 
becomes close to the distribution of S as n increases. 


The central limit theorem. We now restate the central limit theorem using the 
concept of convergence in distribution. The central limit theorem in Key Con- 
cept 2.7 states that if Y;,..., Y, are iid. and 0 < o} < ~, then the asymptotic 
distribution of (Y — py) /oy is N(0, 1). Because oy = oy/ Vn, (Y — py) /oy= 
Vnl(¥ — py)/oy. Thus the central limit theorem can be restated as 

n(Y — py) a oyZ, where Z is a standard normal random variable. This means 
that the distribution of Vn(Y — py) converges to N(0, 0+) as n —> œ. Conven- 
tional shorthand for this limit is 


Vn — uy) —& N(0, 0%). (18.8) 


That is,if Y,;,..., Y,, areiid.and0 < gł, < ~,then the distribution of Vn(Y — py) 
converges to a normal distribution with mean 0 and variance a}. 


Extensions to time series data. The law of large numbers and central limit theorem 
stated in Section 2.6 apply to 1.1.d. observations. As discussed in Chapter 14, the i.i.d. 
assumption is inappropriate for time series data, and these theorems need to be 
extended before they can be applied to time series observations. Those extensions 
are technical in nature in the sense that the conclusion is the same —versions of the 
law of large numbers and the central limit theorem apply to time series data—but 
the conditions under which they apply are different. This is discussed briefly in 
Section 16.4, but a mathematical treatment of asymptotic distribution theory for time 
series variables is beyond the scope of this text, and interested readers are referred 
to Hayashi (2000, Chapter 2). 


Slutsky’s Theorem and the Continuous 
Mapping Theorem 


Slutsky’s theorem combines consistency and convergence in distribution. Suppose 
that a,, — a, where a is a constant, and S, —%> §.Then 


an + S, —> a + S, aS, — aS, and, ifa # 0, S,/a, — S/a. (18.9) 


These three results are together called Slutsky’s theorem. 
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The continuous mapping theorem concerns the asymptotic properties of a 
continuous function, g, of a sequence of random variables, S,,. The theorem has 
two parts. The first is that if S,, converges in probability to the constant a, then 
g(S,,) converges in probability to g(a); the second is that if S,, converges in distri- 
bution to S, then g(S,,) converges in distribution to g(S). That is, if g is a continu- 
ous function, then 


i) if S, > , then g(S,, > , and 
© a, then g(S,) >> g(a) eis 


(ii) if S, —— S, then g(S,,) —*> 9(S). 
As an example of (i), if s} > oy, then Vs} = Sy > gy. As an example of (ii), 
suppose that S,, —%> Z, where Z is a standard normal random variable, and let 
g(S,) = SŽ. Because g is continuous, the continuous mapping theorem applies and 
(S,) —— g(Z); that is, S —— Z?. In other words, the distribution of SZ converges 
to the distribution of a squared standard normal random variable, which in turn has 


a xj distribution; that is, $2 ah Xi: 


Application to the t-Statistic Based 
on the Sample Mean 
We now use the central limit theorem, the law of large numbers, and Slutsky’s theo- 
rem to prove that, under the null hypothesis, the t-statistic based on Y has a standard 
normal distribution when Y,,..., Y, are iid.and0 < E(Y}) < ~. 

The t-statistic for testing the null hypothesis that E(Y;) = po based on the sample 
average Y is given in Equations (3.8) and (3.11), and can be written 


_ Y= po _ Va¥ = wo) . sy 
sy/ Vn oy ay 


7 (18.11) 


where the second equality uses the trick of dividing both the numerator and the 
denominator by oy. 

Because Y}, ..., Y, have two moments (which is implied by their having four 
moments; see Exercise 18.5) and because Y;,..., Y, are i.i.d., the first term after the 
final equality in Equation (18.11) obeys the central limit theorem: Under the null 
hypothesis, Vn(Y — po) / oy —4> N(0, 1). In addition, s —2> ø% (as proven in 
Appendix 3.3),so s/o% —* + 1 and the ratio in the second term in Equation (18.11) 
tends to 1 (Exercise 18.4). Thus the expression after the final equality in Equation 
(18.11) has the form of the final expression in Equation (18.9), where [in the notation 
of Equation (18.9)] S, = Vn(¥ — mo)/cy —— N(0, 1) anda, = sy/oy — 1 It 
follows by applying Slutsky’s theorem that £ —— N(0, 1). 
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18.3 Asymptotic Distribution of the OLS 
Estimator and t-Statistic 


Recall from Chapter 4 that, under the assumptions of Key Concept 4.3 (the first three 
assumptions of Key Concept 18.1), the OLS estimator Êi is consistent, and 
Vn(Bi — B,) has an asymptotic normal distribution. Moreover, the t-statistic testing 
the null hypothesis 6; = £;,9 has an asymptotic standard normal distribution under 
the null hypothesis. This section summarizes these results and provides additional 
details of their proofs. 


Consistency and Asymptotic Normality 
of the OLS Estimators 


The large-sample distribution of Bi, originally stated in Key Concept 4.4, is 


ao a, var (v;) 
Vn (Ê: - Bi) N (o. Tar) o (18.12) 


where v; = (X; — py)u;. The proof of this result was sketched in Appendix 4.3, but 
that proof omitted some details and involved an approximation that was not formally 
shown. The missing steps in that proof are left as Exercise 18.3. 

An implication of Equation (18.12) is that Â; is consistent (Exercise 18.4). 


Consistency of Heteroskedasticity-Robust 
Standard Errors 


Under the first three least squares assumptions, the heteroskedasticity-robust stan- 
dard error for f; forms the basis for valid statistical inferences. Specifically, 


Q> 
DN 


— 1], (18.13) 


9 
Ph 


where oF, = var(v,)/{n[var(X)) |7} and & 3, is the square of the heteroskedasticity- 
robust standard error defined in Equation (5.4); that is, 


1 n — 
X, — XY ii? 
3 1 n— 2 > l X) Ui 
4, = P E (18.14) 
tS% = xy 
nizi 


{= 


oO 
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To show the result in Equation (18.13), first use the definitions of A and êh to 


rewrite the ratio in Equation (18.13) as 


1 Z 2492 1 = 2 
nox- XP] | OX) 


Bi | n IG 
o%, n-2 var (v;) , var (X;) 


(18.15) 


We need to show that each of the three terms in brackets on the right-hand side of 
Equation (18.15) converges in probability to 1. Clearly, the first term converges to 1, 
and by the consistency of the sample variance (Appendix 3.3), the final term con- 
verges in probability to 1. Thus all that remains is to show that the second term con- 
verges in probability to 1—that is, that 7 >i- 1(X; — XP? > var(v). 

The prooi that 4_, (X; — Xü? — var(v,) proceeds in two steps. The first 
shows that x a aye a var(v,;); the second shows that an (X; — X ù — 
Sv > 0. 

For the moment, suppose that X; and u; have eight moments [that is, E (X$) < œ% 
and E(u) < œ], which is a stronger assumption than the four moments required by 
the third least squares assumption. To show the first step, we must show that 7 X4 v? 
obeys the law of large numbers in Equation (18.5). To do so, v? must be i.i.d. (which 
it is by the second least squares assumption), and var(v?) must be finite. To show 
that var(v7) < ©, apply the Cauchy-Schwarz inequality (Appendix 18.2): 
var(v3) = E(f) = E[(X) - mut] = {EL(X; — ux)f]Eu$)}" Thus, if X; and u; 
have eight moments, then v? has a finite variance and thus satisfies the law of large 
numbers in Equation (18.5). 

The second step is to prove that 45 (X; — XPA? — 151v? > 0. Because 
v;i = (X; — py)u;, this second step is the same as showing that 


1 n — 
“DIA - XPA =- 0; — wxPu?] >> 0. (18.16) 
i=1 


Showing this result entails setting 7; = u; — (By — Bo) — (Êi — B1)X;, expanding the 
term in Equation (18.16) in brackets, repeatedly applying the Cauchy—Schwarz inequality, 
and using the consistency of Bo and ĝi. The details of the algebra are left as Exercise 18.9. 

The preceding argument supposes that X; and u; have eight moments. This is not 
necessary, however, and the result $ S!_,(X; — X)°ii? —— var (v;) can be proven 
under the weaker assumption that X; and u; have four moments, as stated in the third 
least squares assumption. That proof, however, is beyond the scope of this text; see 
Hayashi (2000, Section 2.5) for details. 


Asymptotic Normality of the Heteroskedasticity- 
Robust t-Statistic 
We now show that, under the null hypothesis, the heteroskedasticity-robust OLS 


t-statistic testing the hypothesis 6; = £; o has an asymptotic standard normal distri- 
bution if least squares assumptions 1, 2, and 3 hold. 
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The t-statistic constructed using the heteroskedasticity-robust standard error 
SEB) = 6g, [defined in Equation (18.14)] is 


B, — Vn(Ê — oA 
By - Bro = (Bi — B10) , b, (18.17) 
1 nop, À, 


t = 

aĝ 

It follows from Equation (18.12) and the definition of oF, that the first term after the 
second equality in Equation (18.17) converges in distribution to a standard normal 
random variable. In addition, because the heteroskedasticity-robust standard error is 
consistent in the sense of Equation (18.13), V 54/05, —~+1 (Exercise 18.4). It 


follows from Slutsky’s theorem that t — N(O, 1). 


Exact Sampling Distributions When 
the Errors Are Normally Distributed 


In small samples, the distribution of the OLS estimator and t-statistic depends on the 
distribution of the regressors and regression error and typically is complicated. As 
discussed in Section 5.6, however, if the regression errors are homoskedastic and 
normally distributed, then these distributions are simple. Specifically, if all five 
extended least squares assumptions in Key Concept 18.1 hold, then the OLS estima- 
tor has a normal sampling distribution, conditional on Xj, ..., X,,. Moreover, the 
t-statistic has a Student f distribution. We present these results here for Ĝi. 


Distribution of By with Normal Errors 


If the errors are i.i.d. normally distributed and independent of the regressors, then 


the distribution of Bi conditional on Xj, ..., Xa is N(Bi, Th)» where 
E oe 18.18 
© Bux © A : ( ` ) 
>% - XF 
The derivation of the normal distribution N(64, o> conditional on X;,..., X» 


entails (i) establishing that the distribution is normal; (ii) showing that 
E(B;|X,,...,X,) = By; and (iii) verifying Equation (18.18). 


To show (i), note that, conditional on Xj, ..., X,,, Bi — PB, 1s a weighted average 
of Uy, ...,Uy: 
1 n pE 
A no T X)u; 
B, = By + (18.19) 
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This equation was derived in Appendix 4.3 [Equation (4.28)] and is restated here 
for convenience. By extended least squares assumptions 1, 2, 4, and 5, u; is 1.i.d. 
N(0, 02), and u; and X; are independently distributed. Because weighted averages of 
normally distributed variables are themselves normally distributed, it follows that Bi 
is normally distributed, conditional on Xj, ..., Xp. 

To show (ii), take conditional expectations of both sides of Equation (18.19): 
EL = B) |X... X,)] = EL DEK — F Ju EA- XPM... Xi] = 
[E(X — X) E(u; | X,...,X,)]/[ D(X - XP] =0, where the final 
equality follows because E(u; bee ...,X,) = E(u;| X) = 0 and because 

"_,(X, — X? # 0 by assumption. Thus Bri is conditionally unbiased; that is, 


E(B; | Xi, --., Xn) = Br. (18.20) 
To show (iii), use the fact that the errors are independently distributed, conditional 
on X, . . . , Xp to calculate the conditional variance of Bi using Equation (18.19): 
n 
> (Xi — X)u; 
var(B; | Xi, te , Xn) = var — = | Xi» Eii , Xn 
D(X - XP 
i=1 
S (X; = Xý var(u;| Xi, -. - , Xp) 
= - ; (18.21) 
[È o - mF 
on > (Xi — XY 
i=1 


n es 2 
Sa - x] 
Canceling the term in the numerator in the final expression in Equation (18.21) 
yields the formula for the conditional variance in Equation (18.18). 


Distribution of the Homoskedasticity-Only t-Statistic 
The homoskedasticity-only t-statistic testing the null hypothesis 6; = 64, o is 


poe ia) (18.22) 
SE(B1) 


where SE(Bi) is computed using the homoskedasticity-only standard error of Bh. 
Substituting the formula for SE (ĝ;) [Equation (5.29) of Appendix 5.1] into Equation 
(18.22) and rearranging ea 


2 
Sia 


— Pio — Pio 
1 j o2 
E oe a 


— Bi, 0) / OB x 
=a 


(18.23) 
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where s = —1,>"_ 0? and W = >!_f? / o2. Under the null hypothesis, Ê; has an 
N(B o, Th distribution, conditional on Xj,..., X ,, so the distribution of the 
numerator in the final expression in Equation (18.23) is N(0, 1). It is shown in Section 19.4 
that W has a chi-squared distribution with n — 2 degrees of freedom and moreover that 
W is distributed independently of the standardized OLS estimator in the numerator of 
Equation (18.23). It follows from the definition of the Student ¢ distribution 
(Appendix 18.1) that, under the five extended least squares assumptions, the 
homoskedasticity-only t-statistic has a Student f distribution with n — 2 degrees of 
freedom. 


Where does the degrees of freedom adjustment fit in? The degrees of freedom 
adjustment in s4 ensures that s% is an unbiased estimator of o7, and that the t-statistic 
has a Student f distribution when the errors are normally distributed. 

Because W = $} ii? /o?2 isa chi-squared random variable with n — 2 degrees 
of freedom, its mean is E(W) = n — 2.Thus E[ W/(n — 2)] = (n — 2)/(n — 2) = 1. 
Rearranging the definition of W, we have that EG Xi; ai?) = 0%. Thus the 
degrees of freedom correction makes s4 an unbiased estimator of ø}. Also, by divid- 


ing by n — 2 rather than n, the term in the denominator of the final expression of 
Equation (18.23) matches the definition of a random variable with a Student t distri- 
bution given in Appendix 18.1. That is, by using the degrees of freedom adjustment 
to calculate the standard error, the t-statistic has the Student ¢ distribution when the 
errors are normally distributed. 


Weighted Least Squares 


Under the first four extended least squares assumptions, the OLS estimator is efficient 
among the class of linear (in Yj, ..., Y„), conditionally (on X4, . . . , Xp) unbiased esti- 
mators; that is, the OLS estimator is the best linear unbiased estimator (BLUE). This 
result is the Gauss—Markov theorem, which was discussed in Section 5.5 and proven in 
Appendix 5.2. The Gauss—Markov theorem provides a theoretical justification for 
using the OLS estimator. A major limitation of the Gauss—Markov theorem is that it 
requires homoskedastic errors. If, as is often encountered in practice, the errors are 
heteroskedastic, the Gauss—Markov theorem does not hold, and the OLS estimator 
is not BLUE. 

This section presents a modification of the OLS estimator, called weighted least 
squares (WLS), which is more efficient than OLS when the errors are 
heteroskedastic. 

WLS requires knowing quite a bit about the conditional variance function, 
var(u; | X;). We consider two cases. In the first case, var(u; | X;) is known up to a factor 
of proportionality, and WLS is BLUE. In the second case, the functional form of 
var(u; | X;) is known, but this functional form has some unknown parameters that can 
be estimated. Under some additional conditions, the asymptotic distribution of WLS 
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in the second case is the same as if the parameters of the conditional variance function 
were, in fact, known, and in this sense, the WLS estimator is asymptotically BLUE. The 
section concludes with a discussion of the practical advantages and disadvantages of 
handling heteroskedasticity using WLS or, alternatively, heteroskedasticity-robust 
standard errors. 


WLS with Known Heteroskedasticity 


Suppose that the conditional variance var(u; | X;) is known up to a factor of propor- 
tionality; that is, 


var(u; | X) = Ah(X)), (18.24) 


where A is a constant and A} is a known function. In this case, the WLS estimator is the 
estimator obtained by first dividing the dependent variable and regressor by the square 
root of h and then regressing this modified dependent variable on the modified regres- 
sor using OLS. Specifically, divide both sides of the single-variable regressor model by 


V h(X;) to obtain 
Y; = BoXo; + Bii + tj, (18.25) 


where Y; = ¥j/VA(X), Xoi = 1/VA(X), Xu = X;/ Vh(X), and ŭ; = u;/ Vh(X). 

The WLS estimator is the OLS estimator of 8; in Equation (18.25); that is, it is 
the estimator obtained by the OLS regression of f, on Xoi and Xin where the coeffi- 
cient on Xy takes the place of the intercept in the unweighted regression. 

Under the first three least squares assumptions in Key Concept 18.1 plus the 
known heteroskedasticity assumption in Equation (18.24), WLS is BLUE. The reason 
that the WLS estimator is BLUE is that weighting the variables has made the error 
term u; in the weighted regression homoskedastic. That is, 


u; x] _ var(u; | X) _ AMX) _ 
Vax) h(X,) h(X;) 


so the conditional variance of ù; var(u;| X;), is constant. Thus the first four least 


var(u; | X) = va à, (18.26) 


squares assumptions apply to Equation (18.25). Strictly speaking, the Gauss-Markov 
theorem was proven in Appendix 5.2 for Equation (18.1), which includes the inter- 
cept Bp, so it does not apply to Equation (18.25), in which the intercept is replaced by 
BX: However, the extension of the Gauss-Markov theorem for multiple regression 
(Section 19.5) does apply to estimation of £; in the weighted population regression, 
Equation (18.25). Accordingly, the OLS estimator of 6, in Equation (18.25) — that is, 
the WLS estimator of 8, —is BLUE. 

In practice, the function h typically is unknown, so neither the weighted variables 
in Equation (18.25) nor the WLS estimator can be computed. For this reason, the 
WLS estimator described here is sometimes called the infeasible WLS estimator. 
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To implement WLS in practice, the function h must be estimated, the topic to which 
we now turn. 


WLS with Heteroskedasticity of Known Functional Form 


If the heteroskedasticity has a known functional form, then the heteroskedasticity 
function A can be estimated, and the WLS estimator can be calculated using this 
estimated function. 


Example 1: The variance of u is quadratic in X. Suppose that the conditional vari- 
ance is known to be the quadratic function 


var(u; | X) = Oo + GX (18.27) 


where 6) and 6, are unknown parameters, 6) > 0, and 0; = 0. 

Because 6) and 6, are unknown, it is not possible to construct the weighted vari- 
ables A A and Xi It is, however, possible to estimate 0) and 6, and to use those 
estimates to compute estimates of var(u; | X;). Let ĝo and 6; be estimators of 6 
and 6, and let Var(u; |X) = 6) + 6,X?. Define the weighted regressors 
as Y, = ¥;/Vvat(u; | X), Xoi = 1/V Vat(u;| X), and Xo; = Xu / V Vat(u; | X;). The 
WLS estimator is the OLS estimator of the coefficients in the regression of Y, on Xoi 
and _X,; (where ByXp; takes the place of the intercept Bo). 

Implementation of this estimator requires estimating the conditional variance 
function—that is, estimating 6) and 6, in Equation (18.27). One way to estimate 0 
and 6, consistently is to regress ii? on X? using OLS, where ii? is the square of the i” 
OLS residual. 

Suppose that the conditional variance has the form in Equation (18.27) and that 
60 and 6; are consistent estimators of 6) and 6,. Under assumptions 1 through 3 of 
Key Concept 18.1 plus additional moment conditions that arise because 6 and 0; are 
estimated, the asymptotic distribution of the WLS estimator is the same as if 0) and 
0, were known. Thus the WLS estimator with 6y and 6, estimated has the same asymp- 
totic distribution as the infeasible WLS estimator and is in this sense asymptotically 
BLUE. 

Because this method of WLS can be implemented by estimating unknown 
parameters of the conditional variance function, this method is sometimes called 
feasible WLS or estimated WLS. 


Example 2: The variance depends on a third variable. WLS also can be used when 
the conditional variance depends on a third variable, W;, which does not appear in 
the regression function. Specifically, suppose that data are collected on three vari- 
ables, Y;, X; and W, i = 1, .. . , n, the population regression function depends on X; 
but not W; and the conditional variance depends on W; but not X;. That is, the popu- 
lation regression function is E(Y, | X;, W) = By + BX; and the conditional variance 
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is var(u; | X, W) = Ah(W,), where A is a constant and A is a function that must be 
estimated. 

For example, suppose that a researcher is interested in modeling the relationship 
between the unemployment rate in a state and a state economic policy variable (X;). 
The measured unemployment rate (Y;), however, is a survey-based estimate of the 
true unemployment rate (Y;).Thus Y; measures Y; with error, where the source of 
the error is random survey error,so Y, = Y; + v; where v; is the measurement error 
arising from the survey. In this example, it is plausible that the survey sample size, W;, 
is not itself a determinant of the true state unemployment rate. Thus the population 
regression function does not depend on W; that is, E(Y; | X;,W;) = By + B.X;. We 
therefore have the two equations, 


Y} = By + BX; + u; and (18.28) 
Y, = Y; +v, (18.29) 


where Equation (18.28) models the relationship between the state economic policy vari- 
able and the true state unemployment rate and Equation (18.29) represents the relation- 
ship between the measured unemployment rate Y; and the true unemployment rate Y7. 

The model in Equations (18.28) and (18.29) can lead to a population regression 
in which the conditional variance of the error depends on W; but not on_X;. The error 
term u; in Equation (18.28) represents other factors omitted from this regression, 
while the error term v; in Equation (18.29) represents measurement error arising 
from the unemployment rate survey. If u; is homoskedastic, then var (u; | X, W) = o? 
is constant. The survey error variance, however, depends inversely on the survey 
sample size W; that is, var(v; | X, W) = a/W,, where a is a constant. Because v; is 
random survey error, it is safely assumed to be uncorrelated with u;, so 
var(u; + v;|X;, W) = 07» + a/W,. Thus, substituting Equation (18.28) into 
Equation (18.29) leads to the regression model with heteroskedasticity: 


Y; = Bo + BX; + u; (18.30) 


1 
var(u; | X, W) = 0o + af): (18.31) 


L 


where u; = u; + v; 0) = o2+,0; = a, and E(u; | X, W) = 0. 

If 6) and 6, were known, then the conditional variance function in Equation 
(18.31) could be used to estimate By and 6, by WLS. In this example, 6) and 6, are 
unknown, but they can be estimated by regressing the squared OLS residual 
[from OLS estimation of Equation (18.30)] on 1/ W; Then the estimated conditional 
variance function can be used to construct the weights in feasible WLS. 

It should be stressed that it is critical that E(u; | X; W) = 0; if not, the weighted 
error will have a nonzero conditional mean, and WLS will be inconsistent. Said dif- 
ferently, if W; is, in fact, a determinant of Y;, then Equation (18.30) should be a mul- 
tiple regression equation that includes both X; and W;. 
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General method of feasible WLS. In general, feasible WLS proceeds in five steps: 


1. Regress Y; on X; by OLS, and obtain the OLS residuals ĉ;, i = 1,..., n. 


2. Estimate a model of the conditional variance function, var(u; | X;). For example, 
if the conditional variance function has the form in Equation (18.27), this entails 
regressing 27 on X?. In general, this step entails estimating a function for the 
conditional variance, var(u; | X;). 


3. Use the estimated function to compute predicted values of the conditional vari- 
ance function, Yat(u; | X). 

4. Weight the dependent variable and regressor (including the intercept) by the 
inverse of the square root of the estimated conditional variance function. 


5. Estimate the coefficients of the weighted regression by OLS; the resulting esti- 
mators are the WLS estimators. 


When the variance of u depends on variables other than X (such as W in example 2), 
steps 2 and 3 are modified accordingly. 

Regression software packages typically include optional weighted least squares 
commands that automate the fourth and fifth of these steps. 


Heteroskedasticity-Robust Standard Errors or WLS? 


There are two ways to handle heteroskedasticity: estimating By and 6, by WLS or 
estimating By and £; by OLS and using heteroskedasticity-robust standard errors. 
Deciding which approach to use in practice requires weighing the advantages and 
disadvantages of each. 

The advantage of WLS is that it is more efficient than the OLS estimator of the 
coefficients in the original regressors, at least asymptotically. The disadvantage of 
WLS is that it requires knowing the conditional variance function and estimating its 
parameters. If the conditional variance function has the quadratic form in Equation 
(18.27), this is easily done. In practice, however, the functional form of the condi- 
tional variance function is rarely known. Moreover, if the functional form is incorrect, 
then the standard errors computed by WLS regression routines are invalid in the 
sense that they lead to incorrect statistical inferences (tests have the wrong size). 

The advantage of using heteroskedasticity-robust standard errors is that they 
produce asymptotically valid inferences even if you do not know the form of the 
conditional variance function. An additional advantage is that heteroskedasticity- 
robust standard errors are readily computed as an option in modern regression pack- 
ages, so no additional effort is needed to safeguard against this threat. The 
disadvantage of heteroskedasticity-robust standard errors is that the OLS estimator 
will have a larger variance than the WLS estimator (based on the true conditional 
variance function). 

In practice, the functional form of var(u; | X;) is rarely, if ever, known, which 
poses a problem for using WLS in real-world applications. This problem is difficult 
enough with a single regressor, but in applications with multiple regressors, it is 
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even more difficult to know the functional form of the conditional variance. For 
this reason, practical use of WLS confronts imposing challenges. In contrast, in 
modern statistical packages it is simple to use heteroskedasticity-robust standard 
errors, and the resulting inferences are reliable under very general conditions; in 
particular, heteroskedasticity-robust standard errors can be used without needing 
to specify a functional form for the conditional variance. For these reasons, it is our 
opinion that, despite the theoretical appeal of WLS, heteroskedasticity-robust stan- 
dard errors provide a better way to handle potential heteroskedasticity in most 


applications.! 


Summary 


1. The asymptotic normality of the OLS estimator, combined with the consistency 
of heteroskedasticity-robust standard errors, implies that, if the first three least 
squares assumptions in Key Concept 18.1 hold, then the heteroskedasticity- 
robust t-statistic has an asymptotic standard normal distribution under the null 
hypothesis. 

2. If the regression errors are i.i.d. and normally distributed, conditional on the 
regressors, then Bi has an exact normal sampling distribution, conditional on 
the regressors. In addition, the homoskedasticity-only t-statistic has an exact 
Student t, sampling distribution under the null hypothesis. 

3. The weighted least squares (WLS) estimator is OLS applied to a weighted 
regression, where all variables are weighted by the square root of the inverse 
of the conditional variance, var(u; | X;), or its estimate. Although the WLS esti- 
mator is asymptotically more efficient than OLS, to implement WLS you must 
know the functional form of the conditional variance function, which usually 
is a tall order. 


Key Terms 

convergence in probability (690) WLS estimator (700) 

consistent estimator (690) infeasible WLS (700) 

convergence in distribution (692) feasible WLS (701) 

asymptotic distribution (693) normal probability density function 
Slutsky’s theorem (693) (p.d.f.) (710) 

continuous mapping theorem (694) bivariate normal p.d.f. (710) 


weighted least squares (WLS) (699) 


'This chapter has focused on the case of a single treatment effect, B;. Heterogeneous treatment effects 
introduce additional complications for WLS. Suppose that the treatment X is randomly assigned and 
the observations (experimental units) are randomly drawn from the population (assumption 2 in 
Key Concept 18.1). Then OLS is a consistent estimator of the average causal effect, but WLS is not 
(Exercise 18.13). 
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Review the Concepts 


18.1 


18.2 


18.3 


18.4 


Suppose that assumption 4 in Key Concept 18.1 is true but you construct a 95% 
confidence interval for 8, using the heteroskedastic-robust standard error in a 
large sample. Would this confidence interval be valid asymptotically in the sense 
that it contained the true value of 6; in 95% of all repeated samples for large n? 
Suppose instead that assumption 4 in Key Concept 18.1 is false but you construct 
a 95% confidence interval for 6, using the homoskedasticity-only standard error 
formula in a large sample. Would this confidence interval be valid asymptotically? 


Suppose that A, is a sequence of random variables that converges in prob- 
ability to 3. Suppose that B,, is a sequence of random variables that converges 
in distribution to a standard normal. What is the asymptotic distribution of 
A,B, Use this asymptotic distribution to compute an approximate value of 
Pr(A,B, < 2). 


Suppose that Y and X are related by the regression Y = 1.0 + 2.0X + u. 
A researcher has observations on Y and X, where 0 = X < 20, where 
the conditional variance is var(u;| X; = x) =1 for 0 <x =< 10 and 
var(u; | X; = x) = 16for 10 < x < 20. Draw a hypothetical scatterplot of the 
observations (X;, Y;),i = 1,...,2. Does WLS put more weight on observa- 
tions with x = 10 or x > 10? Why? 


Instead of using WLS, the researcher in the previous problem decides to com- 
pute the OLS estimator using only the observations for which x = 10, then 
using only the observations for which x > 10, and then using the average the 
two OLS of estimators. Is this estimator more efficient than WLS? 


Exercises 


18.1 


Consider the regression model without an intercept term, Y; = BX; + u; (so 
the true value of the intercept, Bp, is 0). 


a. Derive the least squares estimator of £; for the restricted regression model 
Y, = BX; + u; This is called the restricted least squares estimator (BRS) of 
B, because it is estimated under a restriction, which in this case is By = 0. 
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18.2 


18.3. 


b. Derive the asymptotic distribution of BRLS under assumptions 1 through 


3 of Key Concept 18.1. 


c. Show that BR“ is linear [Equation (5.24)] and, under assumptions 1 and 
2 of Key Concept 18.1, conditionally unbiased [Equation (5.25)]. 


d. Derive the conditional variance of BES under the Gauss—Markov condi- 


tions (assumptions 1 through 4 of Key Concept 18.1). 


e. Compare the conditional variance of BRLS in (d) to the conditional vari- 


ance of the OLS estimator ĝ (from the regression including an inter- 
cept) under the Gauss—Markov conditions. Which estimator is more 
efficient? Use the formulas for the variances to explain why. 
f. Derive the exact sampling distribution of BRS 
through 5 of Key Concept 18.1. 


under assumptions 1 


g. Now consider the estimator Bi = WHY />*,X; Derive an expres- 
sion for var( 81 | Xi, ..., X,) — var(BRYS | Xj, ..., X,) under the 
Gauss—Markov conditions, and use this expression to show that 
var(B,|_X1,...,Xn) = var(BRES | X, ..., Xn). 


Suppose that (X;, Y;) are 1.i.d. with finite fourth moments. Prove that the 
sample covariance is a consistent estimator of the population covari- 
ance — that is, that Syy + oxy, where syy is defined in Equation (3.24). 
(Hint: Use the strategy outlined in Appendix 3.3 and the Cauchy—Schwarz 
inequality.) 


This exercise fills in the details of the derivation of the asymptotic distribution 
of By given in Appendix 4.3. 


a. Use Equation (18.19) to derive the expression 


> 


1 n = 1 n 

N P (X — PA 2i 
V(b, Bi) = 12 = = 12 = 
= es xy 


where v; = (X; — py)u;. 

b. Use the central limit theorem, the law of large numbers, and Slutsky’s 
theorem to show that the final term in the equation converges in 
probability to 0. 

c. Use the Cauchy—Schwarz inequality and the third least squares assump- 
tion in Key Concept 18.1 to prove that var(v;) < %. Does the term 
Vist 1¥;/ 0, Satisfy the central limit theorem? 


d. Apply the central limit theorem and Slutsky’s theorem to obtain the 
result in Equation (18.12). 


18.4 


18.5 
18.6 


18.7 


18.8 


18.9 


18.10 


18.11 
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Show the following results: 
a. Show that Vn (Ê — Bı) — N(0, a°), where a? is a constant, implies 
that Â is consistent. (Hint: Use Slutsky’s theorem.) 


b. Show that s2 / o2 —*> 1 implies that s, /o,, — 1. 
Suppose that W is a random variable with E(W*) < .Show that E(W7) < œ. 


Show that if Bi is conditionally unbiased, then it is unbiased; that is, show that 
if E(B | X, oar) Xn) = Bı, then E(B;) = Py. 


Suppose that X and u are continuous random variables and (X;, u;), i = 1,...,7, 
are iid. 


a. Show that the joint probability density function (p.d-f.) of (u; uj, X; X) 
can be written as f(u; X;) fuj, X;) fori # j, where f(u;, X;) is the joint 
p.d.f. of u; and X;. 

b. Show that E(uju; | X, X) = E(u; | X) E(u; | X) fori # j. 

c. Show that E(u; | X1,...,X,) = E(u; | X). 

d. Show that E(uju; | Xi, X>, t.s Xa) = E(u; | X) E(u; | X) fori # j. 

Consider the regression model in Key Concept 18.1, and suppose that assump- 

tions 1,2,3, and 5 hold. Suppose that assumption 4 is replaced by the assump- 

tion that var(u;| X) = 0) + 0| X 

Xi, 09 > 0, and 6; = 0. 


, where | X; | is the absolute value of 


a. Is the OLS estimator of 8, BLUE? 
b. Suppose that 6) and 0; are known. What is the BLUE estimator of 64? 


c. Derive the exact sampling distribution of the OLS estimator, Bi, condi- 
tional on X),..., Xn- 


d. Derive the exact sampling distribution of the WLS estimator (treating 45 
and 6, as known) of £, conditional on Xj, ..., Xp. 


Prove Equation (18.16) under assumptions 1 and 2 of Key Concept 18.1 plus 
the assumption that X; and u; have eight moments. 


Let 6 be an estimator of the parameter 0, where 6 might be biased. Show that 
if E[ (6 — 0°] —>0asn —> © (that is, if the mean squared error of 
6 tends to 0),then ô —— 9. [Hint: Use Equation (18.43) with W = 6 — 6.] 


Suppose that X and Y are distributed bivariate normal with the density given 
in Equation (18.38). 


a. Show that the density of Y given X = x can be written as 


“age Ey 
frx=x) ae exp] A Oy|x 
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where oyy = Vol — pxy) and myx = by + (oxy|oX)(@ — uy). 
[Hint: Use the definition of the conditional probability density 
fyx=x(Y) = 8x, y(x, y)/fx(x), where gy y is the joint density of X and Y 
and fy is the marginal density of X.] 

b. Use the result in (a) to show that Y |X = x ~ N(uyy, oY\y)- 


c Use the result in (b) to show that E(Y|X = x) = a + bx for suitably 
chosen constants a and b. 


18.12 a. Suppose that u ~ N(0, 07). Show that E(e") = ext 


b. Suppose that the conditional distribution of u given X = xis N(0, a + bx’), 
where a and b are positive constants. Show that E(e" |X = x) = ex4+"), 


18.13 Consider the heterogeneous regression model Y, = Bo; + BX; + u;, where 


18.14 


18.15 


Bo; and fı; are random variables that differ from one observation to the 
next. Suppose that E(u; | X;) = 0 and (Bos B1) are distributed indepen- 
dently of X; and that the observational units are randomly drawn from the 
population. 


a. Let BEES denote the OLS estimator of 8, given in Equation (18.2). Show 
that BP’ > E(B), where E(f;) is the average value of B,; in the 
population. [Hint: See Equation (13.10).] 

b. Suppose that var(u;|X;) = 0) + 0.X?, where 6) and 6, are known posi- 

tive constants. Let BY“5 denote the weighted least squares estimator. 


Does BY/S —— E(B)? Explain. 


Suppose that Y,7 = 1,2,...,n, are iid. with E(Y) = u, var(Y) = o°, and 
finite fourth moments. Show the following: 


a. E(Y?) = K + o. 


b. Y — u. 
1 1 
e Sr 2 pto 
nizi 
1 ua 2 1 Aa 2 2 
d. -> (= Y) XY =Y 
n=] i=1 
1 n 2 p 2 
e 15 0m- o 
i=1 


Z is distributed N(0, 1), W is distributed x7, and V is distributed 7,. Show, as 
n— © and mis fixed, that 


a. W/n = 1. 
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b 2 —% N(0, 1). Use the result to explain why the t. distribution is 
VW/n 
the same as the standard normal distribution. 


j f ee eee 
c ote — > y?,/m. Use the result to explain why the F,,.. distribution is 


the same as the x7, /m distribution. 


The Normal and Related Distributions and 
Moments of Continuous Random Variables 


This appendix defines and discusses the normal and related distributions. The definitions of 
the chi-squared, F, and Student t distributions, given in Section 2.4, are restated here for con- 
venient reference. We begin by presenting definitions of probabilities and moments involving 


continuous random variables. 


Probabilities and Moments of Continuous 
Random Variables 


As discussed in Section 2.1, if Y is a continuous random variable, then its probability is sum- 
marized by its probability density function (p.d.f.). The probability that Y falls between two 
values is the area under its p.d.f. between those two values. Because Y is continuous, however, 
the mathematical expressions for its probabilities involve integrals rather than the summations 
that are appropriate for discrete random variables. 

Let fy denote the probability density function of Y. Because probabilities cannot be negative, 
fy(y) = 0 for all y. The probability that Y falls between a and b (where a < b) is 


b 
Pr(a =< Y =< b) = | fener. (18.32) 


Because Y must take on some value on the real line, Pr(~% = Y = œ) = 1, which implies 
that [S fyy)dy = 1. 

Expected values and moments of continuous random variables, like those of discrete 
random variables, are probability-weighted averages of their values except that summations 
[for example, the summation in Equation (2.3)] are replaced by integrals. Accordingly, the 


expected value of Y is 


E(Y) = py = JyfyO)dy, (18.33) 
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where the range of integration is the set of values for which fy is nonzero. The variance is the 
expected value of (Y — jy)’, the 7 moment of a random variable is the expected value of Y”, 


and the r“” central moment is the expected value of (Y — uy)". Thus 


var(Y) = EY - py = fO - wyPfev)aby, (18.34) 
EY”) = J y'fry)dy, (18.35) 


and similarly for the r'" central moment, E(Y — py)’. 


The Normal Distribution 


The normal distribution for a single variable. The probability density function of a normally 
distributed random variable (the normal probability density function (p.d.f.)) is 


1 1/y— BY 
ea an 
oV 2r 2\ 6 
where exp(x) is the exponential function of x. The factor 1/(a V 277) in Equation (18.36) 


ensures that Pr(-~ =< Y s œ) = J .fe(y) dy = 1. 


The mean of the normal distribution is u, and its variance is 0”. The normal distribution is 


symmetric, so all odd central moments of order three and greater are 0. The fourth central moment 


is 307. In general, if Y is distributed N(u, 07), then its even central moments are given by 


E(Y - p} = 


a o* (k even). (18.37) 
2*/2(k /2)! 


When u = 0 and o° = 1, the normal distribution is called the standard normal distribution. 

The standard normal p.d.f. is denoted ¢, and the standard normal cumulative distribution func- 

tion (c.d.f.) is denoted ®. Thus the standard normal density is ¢(y) = P (-3) and 
; 2r 

Dy) = J2pls)ds. 

The bivariate normal distribution. The bivariate normal p.d.f. for the two random variables 


X and Y is 


1 


(x,y) x exp{ —_t_| (#—#x) 
ExyX y) = 
2moyoyV1 — pxy —2(1 = pxy) Ox 


ron” e(z -r l C yh, (18.38) 


where pyy is the correlation between X and Y. 


When X and Y are uncorrelated (pyy = 0), gy y(x, y) = fx(x)fy(y), where fis the normal 
density given in Equation (18.36). This proves that if X and Y are jointly normally distributed 
and are uncorrelated, then they are independently distributed. This is a special feature of the 
normal distribution that is typically not true for other distributions. 

The multivariate normal distribution extends the bivariate normal distribution to handle 
more than two random variables. This distribution is most conveniently stated using matrices 


and is presented in Appendix 19.1. 


The conditional normal distribution. Suppose that X and Y are jointly normally distributed. 


Then the conditional distribution of Y given X is N(uyx, oy|x), with mean 
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My|x = Hy + (ayyloX)(X — wy) and variance oy x = (1 — pyy)o}. The mean of this condi- 
tional distribution, conditional on X = x, is a linear function of x, and the variance does not 


depend on x (Exercise 18.11). 


Related Distributions 


The chi-squared distribution. Let Z1, Z2, . . . , Z„ be n i.i.d. standard normal random variables. 


The random variable 


W= SZ; (18.39) 


i=1 


has a chi-squared distribution with n degrees of freedom. This distribution is denoted 2. 
Because E(Z?) = 1 and E(Z}) = 3, E(W) = n and var(W) = 2n. 


The Student t distribution. Let Z have a standard normal distribution, let W have a y4, distri- 
bution, and let Z and W be independently distributed. Then the random variable 
t= a (18.40) 
VW/m l 
has a Student ż distribution with m degrees of freedom, denoted ¢,,. The t» distribution is the 


standard normal distribution. (See Exercise 18.15.) 


The F distribution. Let W, and W, be independent random variables with chi-squared distri- 
butions with respective degrees of freedom n; and nz. Then the random variable 
W, /n 


F= 
W/m 


(18.41) 


has an F distribution with (74, n2) degrees of freedom. This distribution is denoted F,,, n,» 

The F distribution depends on the numerator degrees of freedom n4 and the denominator 
degrees of freedom n,. As number of degrees of freedom in the denominator gets large, the 
F,,,n, distribution is well approximated by a Xm distribution, divided by n4. In the limit, the 
Fa,» distribution is the same as the X distribution, divided by n4; that is, it is the same as 
the x2 _/n, distribution. (See Exercise 18.15.) 


Two Inequalities 


This appendix states and proves Chebychev’s inequality and the Cauchy—Schwarz inequality. 


Chebychev's Inequality 


Chebychev’s inequality uses the variance of the random variable V to bound the probability 


that V is farther than +6 from its mean, where 6 is a positive constant: 


var 
Pr(|V = py| = 8) = me (Chebychev's inequality). (18.42) 


712 CHAPTER 18 The Theory of Linear Regression with One Regressor 


To prove Equation (18.42), let W = V — py, let f be the p.d.f. of W, and let 6 be any positive 


number. Now 


-8 3 2 
= 2 d 2 d 2 d 
frowa f wwa S wf(w)dw 


œ 


-ő 
J w?f(w)dw + f w2f(w)dw (18.43) 
- 5 


af for -f dw] 


= §Pr(| W| = ô), 


V 


V 


where the first equality is the definition of E(W°), the second equality holds because the 
ranges of integration divide up the real line, the first inequality holds because the term that 
was dropped is nonnegative, the second inequality holds because w? = ô? over the range of 
integration, and the final equality holds by the definition of Pr(|W| = 8). Substituting 
W = V — m into the final expression, noting that E(W’) = E[(V — uy}?] = var(V), and 
rearranging yields the inequality given in Equation (18.42). If V is discrete, this proof applies 


with summations replacing integrals. 


The Cauchy-Schwarz Inequality 


The Cauchy—Schwarz inequality is an extension of the correlation inequality, | pxy| < 1, to 


incorporate nonzero means. The Cauchy—Schwarz inequality is 
|E(XY)| = VE(X*)E(Y’) (Cauchy-Schwarz inequality). (18.44) 


The proof of Equation (18.44) is similar to the proof of the correlation inequality in Appendix 
2.1.Let W = Y + bX,where b is a constant. Then E(W’) = E(Y*) + 2bE(XY) + b°E(X?). 
Now let b = —E(XY)/E(X’), so that (after simplification) the expression becomes 
E(W?) = E(Y’) — [E(XY)]?/ E(X’). Because E(W7) = 0 (since W? = 0), it must be the 
case that [ E(XY) ]* = E(X?)E(Y”), and the Cauchy-Schwarz inequality follows by taking the 


square root. 


The Theory 
1 9 of Multiple Regression 


Th chapter provides an introduction to the theory of multiple regression analysis. 
The chapter has four objectives. The first is to present the multiple regression 
model in matrix form, which leads to compact formulas for the ordinary least squares 
(OLS) estimator and test statistics. The second objective is to characterize the sampling 
distribution of the OLS estimator, both in large samples (using asymptotic theory) and 
in small samples (if the errors are homoskedastic and normally distributed). The third 
objective is to study the theory of efficient estimation of the coefficients of the 
multiple regression model and to describe generalized least squares (GLS), a method 
for estimating the regression coefficients efficiently when the errors are heteroskedastic 
and/or correlated across observations. The fourth objective is to provide a concise 
treatment of the asymptotic distribution theory of instrumental variables (IV) regression 
in the linear model, including an introduction to generalized method of moments 
(GMM) estimation in the linear IV regression model with heteroskedastic errors. 

The chapter begins by laying out the multiple regression model and the OLS 
estimator in matrix form in Section 19.1. This section also presents the extended least 
squares assumptions for the multiple regression model. The first four of these 
assumptions are the same as the least squares assumptions of Key Concept 6.4 and 
underlie the asymptotic distributions used to justify the procedures described in 
Chapters 6 and 7. The remaining two extended least squares assumptions are stronger 
and permit us to explore in more detail the theoretical properties of the OLS estimator 
in the multiple regression model. 

The next three sections examine the sampling distribution of the OLS estimator and 
test statistics. Section 19.2 presents the asymptotic distributions of the OLS estimator and 
t-statistic under the least squares assumptions of Key Concept 6.4. Section 19.3 unifies 
and generalizes the tests of hypotheses involving multiple coefficients presented in 
Sections 7.2 and 7.3 and provides the asymptotic distribution of the resulting F-statistic. 
In Section 19.4, we examine the exact sampling distributions of the OLS estimator and 
test statistics in the special case that the errors are homoskedastic and normally 
distributed. Although the assumption of homoskedastic normal errors is implausible 
in most econometric applications, the exact sampling distributions are of theoretical 
interest, and p-values computed using these distributions often appear in the output 
of regression software. 

The next two sections turn to the theory of efficient estimation of the coefficients 
of the multiple regression model. Section 19.5 generalizes the Gauss-Markov theorem 
to multiple regression. Section 19.6 develops the method of generalized least squares 
(GLS). 
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19.1 


The final section takes up IV estimation in the general IV regression model when 
the instruments are valid and strong. This section derives the asymptotic distribution 
of the two stage least squares (TSLS) estimator when the errors are heteroskedastic 
and provides expressions for the standard error of the TSLS estimator. The TSLS 
estimator is one of many possible GMM estimators, and this section provides an 
introduction to GMM estimation in the linear IV regression model. It is shown that the 
TSLS estimator is the efficient GMM estimator if the errors are homoskedastic. 


Mathematical prerequisite. The treatment of the linear model in this chapter uses 
matrix notation and the basic tools of linear algebra and assumes that the reader has 
taken an introductory course in linear algebra. Appendix 19.1 reviews vectors, matrices, 
and the matrix operations used in this chapter. In addition, multivariate calculus is used 
in Section 19.1 to derive the OLS estimator. 


The Linear Multiple Regression Model 
and OLS Estimator in Matrix Form 


The linear multiple regression model and the OLS estimator can each be represented 
compactly using matrix notation. 


The Multiple Regression Model in Matrix Notation 
The population multiple regression model (Key Concept 6.2) is 
Yi = Bo + PiX + Bora ++ + BX + upi = 1,..., 0. (19.1) 


To write the multiple regression model in matrix form, define the following vectors 
and matrices: 


uy 1 Xa >e Xa Xi Po 
1 X tee X X» 
u=|”]|x=|} že 0 žaeļ|=|ž? |aag=|#], a2 
Un 1 Xin iia Xin x, Bx 


so Yisn X 1,Xisn X (k + 1), Uisn X 1, and B is (k + 1) x 1. Throughout we 
denote matrices and vectors by bold type. In this notation, 


e Yisthen X 1 dimensional vector of n observations on the dependent variable. 


e Xisthen X (k + 1) dimensional matrix of n observations on the k + 1 regressors 
(including the “constant” regressor for the intercept). 
e The (k + 1) X 1 dimensional column vector X; is the i” observation on the k + 1 


regressors; that is, X; = (1 X4; . . . Xi), where X; denotes the transpose of X;. 
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The Extended Least Squares Assumptions 
in the Multiple Regression Model 19.1 
Y; = XIB + u,i = 1,...,^., (19.3) 


where B is the vector of causal effects and 


1. 
27 


Aua A W 


E(u; | X;) = 0 (u; has conditional mean 0); 

(X;, Y;),i = 1,...,n, are independently and identically distributed (i.i.d.) 
draws from their joint distribution; 

. X; and u; have nonzero finite fourth moments; 

. X has full column rank (there is no perfect multicollinearity); 

. var(u; | X;) = a7 (homoskedasticity); and 


. The conditional distribution of u; given X; is normal (normal errors). 


e Uisthen X 1 dimensional vector of the n error terms. 


e B is the (k + 1) x 1 dimensional vector of the k + 1 unknown regression 
coefficients. 


The multiple regression model in Equation (19.1) for the i" observation, written 
using the vectors B and X;, is 


Y, = X/B+u,i=1,...,n. (19.4) 


In Equation (19.4), the first regressor is the “constant” regressor that always equals 1, 
and its coefficient is the intercept. Thus the intercept does not appear separately in 
Equation (19.4); rather, it is the first element of the coefficient vector B. 

Stacking all observations in Equation (19.4) yields the multiple regression 
model in matrix form: 


Y=Xp+U. (19.5) 


The Extended Least Squares Assumptions 


The extended least squares assumptions for the multiple regression model are the four 
least squares assumptions for causal inference in the multiple regression model in Key 
Concept 6.4 plus the two additional assumptions of homoskedasticity and normally 
distributed errors. The assumption of homoskedasticity is used when we study the effi- 
ciency of the OLS estimator, and the assumption of normality is used when we study 
the exact sampling distribution of the OLS estimator and test statistics. 

The extended least squares assumptions are summarized in Key Concept 19.1. 
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Except for notational differences, the first three assumptions in Key Concept 19.1 
are identical to the first three assumptions in Key Concept 6.4. 

The fourth assumptions in Key Concepts 6.4 and 19.1 might appear different, but, in 
fact, they are the same: They are simply different ways of saying that there cannot be 
perfect multicollinearity. Recall that perfect multicollinearity arises when one regressor 
can be written as a perfect linear combination of the others. In the matrix notation of 
Equation (19.2), perfect multicollinearity means that one column of X is a perfect linear 
combination of the other columns of X, but if this is true, then _X does not have full column 
rank. Thus saying that X has rank k + 1—that is, rank equal to the number of columns 
of X—is just another way to say that the regressors are not perfectly multicollinear. 

The fifth least squares assumption in Key Concept 19.1 is that the error term is 
conditionally homoskedastic, and the sixth assumption is that the conditional distribu- 
tion of u; given X; is normal. These two assumptions are the same as the final two 
assumptions in Key Concept 18.1 except that they are now stated for multiple 
regressors. 


Implications for the mean vector and covariance matrix of U. The least squares 
assumptions in Key Concept 19.1 imply simple expressions for the mean vector and 
covariance matrix of the conditional distribution of U given the matrix of regressors 
X. (The mean vector and covariance matrix of a vector of random variables are 
defined in Appendix 19.2.) Specifically, the first and second assumptions in Key Con- 
cept 19.1 imply that E(u;|X) = E(u;| X) = 0 and that cov(u;, uj|X) = E(uju;|X) = 
E(uju;| X;, X) = E(u;| X;)E(u;|X;) = 0 for i # j (Exercise 18.7). The first, second, 
and fifth assumptions imply that E(u?|X) = E(u?|X;) = o}. Combining these results, 
we have that 


under assumptions 1 and 2, E(U| X) = 0,, and (19.6) 
under assumptions 1, 2, and 5, E(UU' |X) = o7I,, (19.7) 


where 0, is the n-dimensional vector of zeros and J, is the n X n identity matrix. 

Similarly, the first, second, fifth, and sixth assumptions in Key Concept 19.1 imply 
that the conditional distribution of the n-dimensional random vector U, conditional 
on X, is the multivariate normal distribution (defined in Appendix 19.2). That is, 


under assumptions 1, 2, 5, and 6, the 


conditional distribution of U given X is N(0,, o71,). (19.8) 


The OLS Estimator 


The OLS estimator minimizes the sum of squared prediction mistakes, 
Si-1(¥, — bo — bX; — +++ — by Xi)? [Equation (6.8)]. The formula for the OLS 
estimator is obtained by taking the derivative of the sum of squared prediction 


2 
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mistakes with respect to each element of the coefficient vector, setting these deriva- 
tives to 0, and solving for the estimator B. 

The derivative of the sum of squared prediction mistakes with respect to the j™® 
regression coefficient, bj, is 


ð n 
ab; 2% bo = bX = +++ — BEX qi)? 
ji= 


= -25 XY; bo — bX; = +++ bkXki) (19.9) 
i=1 


for j = 0,...,k, where, for j = 0, Xo; = 1 for all i. The derivative on the right-hand 
side of Equation (19.9) is the j element of the k + 1 dimensional vector, 
—2X'(Y — Xb), where b is the k + 1 dimensional vector consisting of bo, ... , Dx. 
There are k + 1 such derivatives, each corresponding to an element of b. Combined, 
these yield the system of k + 1 equations that, when set to 0, constitute the first-order 
conditions for the OLS estimator ĝ. That is, Ê solves the system of k + 1 equations: 


X'(Y — XB) = 0441 (19.10) 


or, equivalently, X'Y = X'X B. 
Solving the system of equations (19.10) yields the OLS estimator B in matrix form: 


Ê = (XX) Y’Y, (19.11) 
where (XX) | is the inverse of the matrix X'X. 


The role of “no perfect multicollinearity.” The fourth least squares assumption in 
Key Concept 19.1 states that X has full column rank. In turn, this implies that the 
matrix X'X has full rank — that is, that X’X is nonsingular. Because X'X is nonsingu- 
lar, it is invertible. Thus the assumption that there is no perfect multicollinearity 
ensures that (X’X)"! exists, so Equation (19.10) has a unique solution and the for- 
mula in Equation (19.11) for the OLS estimator can actually be computed. Said dif- 
ferently, if X does not have full column rank, there is not a unique solution to 
Equation (19.10), and X’X is singular. Therefore, (X’X) t! cannot be computed, and 
thus B cannot be computed from Equation (19.11). 


Asymptotic Distribution of the OLS 
Estimator and t-Statistic 


If the sample size is large and the first four assumptions of Key Concept 19.1 are 
satisfied, then the OLS estimator has an asymptotic joint normal distribution, the 
heteroskedasticity-robust estimator of the covariance matrix is consistent, and the 
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The Multivariate Central Limit Theorem 


TZ 


Suppose that W;,..., W, are i.i.d. m-dimensional random variables with mean 
vector E(W;) = uw and covariance matrix E[(W; — uw)(W; — uw)'] = Èw, 
where Yw is positive definite and finite. Let W = ie Then 
VnW - py) — NO Zw): 


heteroskedasticity-robust OLS t-statistic has an asymptotic standard normal distribu- 
tion. These results make use of the multivariate normal distribution (Appendix 19.2) 
and a multivariate extension of the central limit theorem. 


The Multivariate Central Limit Theorem 


The central limit theorem of Key Concept 2.7 applies to a one-dimensional random 
variable. To derive the joint asymptotic distribution of the elements of Ê, we need a 
multivariate central limit theorem that applies to vector-valued random variables. 

The multivariate central limit theorem extends the univariate central limit theorem 
to averages of observations on a vector-valued random variable, W, where W is 
m-dimensional. The difference between the central limit theorems for a scalar-valued 
random variable and that for a vector-valued random variable is the conditions on the 
variances. In the scalar case in Key Concept 2.7, the requirement is that the variance is 
both nonzero and finite. In the vector case, the requirement is that the covariance 
matrix is both positive definite and finite. If the vector-valued random variable W has a 
finite positive definite covariance matrix, then 0 < var(c'W) < œ% for all nonzero 
m-dimensional vectors c (Exercise 19.3). 

The multivariate central limit theorem that we will use is stated in Key Concept 19.2. 


Asymptotic Normality of ĝ 


In large samples, the OLS estimator has the multivariate normal asymptotic 
distribution 


Vin(B - B) > NOg+1, Xvn(é—p))» where Evag -p) = Ox'XvQx', (19.12) 


where Qy is the (k + 1) x (k + 1) dimensional matrix of second moments of the 
regressors—that is, Qy = E(X;X/)—and Ly is the (k + 1) X (k + 1) dimensional 
covariance matrix of V; = Xju;—that is, $y = E(V,V;’). Note that the second least 
squares assumption in Key Concept 19.1 implies that V, i = 1,...,n, are iid. 

Written in terms of B rather than Vn(B — B), the normal approximation in 
Equation (19.12) is 


Ê, in large samples, is approximately distributed N(B, X ĝ), 


where 5 = Svr -p/n = Qx LyOx'/n. (19.13) 
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The covariance matrix $â in Equation (19.13) is the covariance matrix of the approxi- 
mate normal distribution of Ê, whereas $ vnc - p) in Equation (19.12) is the covariance 
matrix of the asymptotic normal distribution of Vn ( B — B).These two covariance matri- 
ces differ by a factor of n, depending on whether the OLS estimator is scaled by Vn. 


Derivation of Equation (19.12). To derive Equation (19.12), first use Equations 
(19.3) and (19.11) to write Ê = (X'X) LX'Y = (X'X) |X'(XB + U),so that 


B= B+ (X'X)'X'U. (19.14) 


Thus Ê — B = (X'X)!X'U, so 


Vin(B - B) = (2) (2). (19.15) 


The derivation of Equation (19.12) involves arguing first that the “denominator” 


matrix in Equation (19.15), X' X /n,is consistent for Q y and second that the “numerator” 
matrix, X’U/ Vn, obeys the multivariate central limit theorem in Key Concept 19.2. The 
details are given in Appendix 19.3. 


Heteroskedasticity- Robust Standard Errors 


The heteroskedasticity-robust estimator of 2 yng —g) is obtained by replacing the 
population moments in its definition [Equation (19.12)] by sample moments. Accord- 
ingly, the heteroskedasticity-robust estimator of the covariance matrix of Vn( B — B)is 


- VV (FEY A 2. 1 oyy 
> vniB-B) -(** ) io( z ) , where $p ~ k Leki, (19.16) 


The estimator Í ý incorporates the same degrees-of-freedom adjustment that is in the 
standard error of the regression (SER) for the multiple regression model (Section 6.4) to 
adjust for potential downward bias because of estimation of k + 1 regression coefficients. 


The proof that $ mÊ- B) 2P, n(B -p) 18 conceptually similar to the proof, 
presented in Section 18.3, of the consistency of heteroskedasticity-robust standard 
errors for the single-regressor model. 


Heteroskedasticity-robust standard errors. The heteroskedasticity-robust estimator 
of the covariance matrix of Ê, p3 Ê is 


Da = n Enh -p (19.17) 


The heteroskedasticity-robust standard error for the j™ regression coefficient is 
the square root of the j'* diagonal element of XA- That is, the heteroskedasticity- 
robust standard error of the j' coefficient is 


SE(B) = VEpir (19.18) 


where (ia); is the (j, j) element of $; 
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Other heteroskedasticity-robust variance estimators. The variance estimator in 
Equation (19.16) is called the HC1 variance estimator. The HC1 estimator is the 
most commonly used in practice, but it is not the only heteroskedasticity-robust 
variance estimator. Simulation studies have found that, in small samples, the HC1 
estimator can be biased down, yielding standard errors that are too small. Long and 
Ervin (2000) provide simulation evidence that in small samples HC1 can be 
improved upon by a variant that weights each squared residual by a function of the 
X’s. Imbens and Kolesar (2016) point out that, in addition to this bias, in small 
samples the sampling variability of the variance estimator makes the normal 
approximation a poor one, and they suggest using instead a t approximation to the 
t-statistic, along with a different variance estimator than HC1 or that suggested by 
Long and Ervin (2000). Angrist and Pischke (2009) suggest, however, that when 
the sample size exceeds 50, the HC1 estimator leads to negligible size distortions. 
Consistent with modern econometric practice, this text focuses on large samples, 
for which the HC1 estimator works well. 


Confidence Intervals for Predicted Effects 


Section 8.1 describes two methods for computing the standard error of predicted 
effects that involve changes in two or more regressors. There are compact matrix 
expressions for these standard errors and thus for confidence intervals for predicted 
effects. 

Consider a change in the value of the regressors for the i” observation from 
some initial value—say, X;y—to some new value—X;9 + d—so that the change 
in X; is AX; = d, where dis a k + 1 dimensional vector. This change in X can 
involve multiple regressors (that is, multiple elements of X;). For example, if two 
of the regressors are the value of an independent variable and its square, then d 
is the difference between the subsequent and initial values of these two 
variables. 

The expected effect of this change in X; is d' B, and the estimator of this effect is 
d'B. Because linear combinations of normally distributed random variables 
are themselves normally distributed, Vn(d'B — d'B) = d'Vn(Ê — B) — 
N(0, d' > nf —p)@). Thus the standard error of this predicted effect is (d'Sea)", 
A 95% confidence interval for this predicted effect is 


d'B + 1.96V/d'd pd. (19.19) 


Asymptotic Distribution of the t-Statistic 


The t-statistic testing the null hypothesis that 6; = £9, constructed using the hetero- 
skedasticity-robust standard error in Equation (19.18), is given in Key Concept 7.1. 
The argument that this t-statistic has an asymptotic standard normal distribution 
parallels the argument given in Section 18.3 for the single-regressor model. 
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19.3 Tests of Joint Hypotheses 


Section 72 considers tests of joint hypotheses that involve multiple restrictions, where 
each restriction involves a single coefficient, and Section 73 considers tests of a single 
restriction involving two or more coefficients. The matrix setup of Section 19.1 permits 
a unified representation of these two types of hypotheses as linear restrictions on the 
coefficient vector, where each restriction can involve multiple coefficients. Under the first 
four least squares assumptions in Key Concept 19.1, the heteroskedasticity-robust OLS 
F-statistic testing these hypotheses has an F} .. asymptotic distribution under the null 
hypothesis. 


Joint Hypotheses in Matrix Notation 


Consider a joint hypothesis that is linear in the coefficients and imposes q restrictions, 
where q = k + 1.Each of these q restrictions can involve one or more of the regression 
coefficients. This joint null hypothesis can be written in matrix notation as 


RB =r, (19.20) 


where Risag X (k + 1) nonrandom matrix with full row rank and ris a nonrandom 
q X 1 vector. The number of rows of R is q, which is the number of restrictions being 
imposed under the null hypothesis. 

The null hypothesis in Equation (19.20) subsumes all the null hypotheses con- 
sidered in Sections 7.2 and 73. For example, a joint hypothesis of the type considered 
in Section 72 is that By = 0, B; = 0,..., B,-1 = 0. To write this joint hypothesis in 
the form of Equation (19.20), set R = [I, 0,x(c+1-q)] andr = 04. 

The formulation in Equation (19.20) also captures the restrictions of Section 73 
involving multiple regression coefficients. For example, ifk = 2,then the hypothesis that 
Bı + & = 1 can be written in the form of Equation (19.20) by setting R = [011], 
r=l,andq = 1. 


Asymptotic Distribution of the F-Statistic 
The heteroskedasticity-robust F-statistic testing the joint hypothesis in Equation 
(19.20) is 
F = (RB — r)'[R3gR'] (RÊ — r)/q. (19.21) 
If the first four assumptions in Key Concept 19.1 hold, then under the null 
hypothesis 
F—> F o (19.22) 
This result follows by combining the asymptotic normality of B with the con- 


sistency of the heteroskedasticity-robust estimator 2} yj;g—,) of the covariance 
matrix. Specifically, first note that Equation (19.12) and Equation (19.74) in 
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19.4 


Appendix 19. a imply that, under the null hypothesis, Vn(RB -r)= 
VnR(B — B) —> NO, RÈ vah- -p)R' ). It follows from Equation (19.77) that, 
under the null hypothesis, (RB - oy eet 'T! (RB - r) = [VnR(Ê - Al 
[R Eva -pR'T'[VnR (Ê - B)] -L X. However, because Eng- -B) > 
E viib - B» it geld from B theorem that [VnR(B —- B)]' 
[RE vag - RT TVAR (Ê - B)] aS X4. or, equivalently (because $= 
Èg- g)/n), that F —L X4/q, Which is in turn distributed F,, ... 


Confidence Sets for Multiple Coefficients 


As discussed in Section 7.4, an asymptotically valid confidence set for two or more 
elements of B can be constructed as the set of values that, when taken as the null 
hypothesis, are not rejected by the F-statistic. In principle, this set could be computed 
by repeatedly evaluating the F-statistic for many values of B, but, as is the case with 
a confidence interval for a single coefficient, it is simpler to manipulate the formula 
for the test statistic to obtain an explicit formula for the confidence set. 

Here is the procedure for constructing a confidence set for two or more of the 
elements of B. Let 6 denote the g-dimensional vector consisting of the coefficients 
for which we wish to construct a confidence set. For example, if we are constructing 
a confidence set for the regression coefficients 8, and f, then q = 2 and 6 = (Bı fb)’. 
In general, we can write 6 = RB, where the matrix R consists of 0’s and 1’s [as dis- 
cussed following Equation (19.20)]. The F-statistic testing the hypothesis that 6 = 6 
is F = (6 — ô) [REPRE — 6o)/q, where ô = RB. A 95% confidence set for 6 
is the set of values ô that are not rejected by the F-statistic. That is, when 6 = RB, 
a 95% confidence set for 6 is 


{8: (6 — 6)'[REpR'] (6 — 8)/q = c}, (19.23) 


where c is the 95"" percentile (the 5% critical value) of the F,, » distribution. 

The set in Equation (19.23) consists of all the points contained inside the ellipse 
determined when the inequality in Equation (19.23) is an equality (this is an ellipsoid 
when q > 2). Thus the confidence set for 6 can be computed by solving Equation 
(19.23) for the boundary ellipse. 


Distribution of Regression Statistics 
with Normal Errors 


The distributions presented in Sections 19.2 and 19.3, which were justified by appeal- 
ing to the law of large numbers and the central limit theorem, apply when the sample 
size is large. If, however, the errors are homoskedastic and normally distributed, con- 
ditional on X, then the OLS estimator has a multivariate normal distribution in a 
finite sample, conditional on X. In addition, the finite sample distribution of the 


19.4 Distribution of Regression Statistics with Normal Errors 723 


square of the standard error of the regression is proportional to the chi-squared dis- 
tribution with n — k — 1 degrees of freedom, the homoskedasticity-only OLS 
t-statistic has a Student f distribution with n — k — 1 degrees of freedom, and the 
homoskedasticity-only F-statistic has an F} n-x-1 distribution. The arguments in this 
section employ some specialized matrix formulas for OLS regression statistics, which 
are presented first. 


Matrix Representations of OLS Regression Statistics 


The OLS predicted values, residuals, and sum of squared residuals have compact matrix 
representations. These representations make use of two matrices, Py and My. 


The matrices Py and My. The algebra of OLS in the multivariate model relies on the 
two symmetric n X n matrices, Py and My: 


Py = X(X'X)1X’ and (19.24) 
My = I, — Py. (19.25) 


A matrix C is idempotent if C is square and CC = C (see Appendix 19.1). Because 
Py = PyPy and My = MyMy (Exercise 19.5) and because Py and My are symmet- 
ric, Py and My are symmetric idempotent matrices. 

The matrices Py and My have some additional useful properties (Exercise 19.5), 
which follow directly from the definitions in Equations (19.24) and (19.25): 


PyX = Xand MyX = On (K+1)3 
rank(Py) = k + 1 and rank(My) = n—k — 1, (19.26) 


where rank(Py) is the rank of Py. 

The matrices Py and My can be used to decompose an n-dimensional vector Z 
into two parts: a part that is spanned by the columns of X and a part that is orthogo- 
nal to the columns of X. In other words, PyZ is the projection of Z onto the space 
spanned by the columns of X, MyZ is the part of Z orthogonal to the columns of X, 
and Z = PyZ + MyZ. 


OLS predicted values and residuals. The matrices Py and My provide some simple 
expressions for OLS predicted values and residuals. The OLS predicted values, Y= x B : 
and the OLS residuals, U = Y — Y, canbe expressed as follows (Exercise 19.5): 


Y = PyY and (19.27) 
U = MyY = MyU. (19.28) 


The expressions in Equations (19.27) and (19.28) provide a simple proof that the 
OLS residuals and predicted values are orthogonal— that is, that Equation (4.35) holds: 
YU = Y'Py MyY = 0,where the second equality follows from Py My = 0„xn, which 
in turn follows from MyX = 0„x(x + 1) in Equation (19.26). 
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The standard error of the regression. The SER, defined in Section 4.3, is sz, where 


1 Z AD. = 1 TITT — 1 ' 
= >a = 0 = ___ uM, (19.29) 


where the final equality follows because 1g = (MyU)'(MyU) = U'MyM,xU = 
U'MyU (because My is symmetric and idempotent). 


Distribution of Ê with Independent Normal Errors 


Because B = B + (X'XY'X'U [Equation (19.14)] and because the distribution 
of U, conditional on X, is, by assumption, N(0,, 02/,,) [Equation (19.8)], the condi- 
tional distribution of B given X is multivariate normal with mean B. The covari- 
ance matrix of Ê, conditional on X, is Lax = EÊ 7 BÊ — By |X] = E[(X'X) 
X'UU'X(X'X) !|X] = (XX) 1 X"(021,) X(X'X) 1 = 02 (X'X) 1. Accordingly, 
under all six assumptions in Key Concept 19.1, the finite-sample conditional distribu- 
tion of B given X is 


Ê ~ N(B, Xa), where Tay = o(X'X) !. (19.30) 


Distribution of så 


If all six assumptions in Key Concept 19.1 hold, then s4 has an exact sampling distri- 
bution that is proportional to a chi-squared distribution with n — k — 1 degrees of 
freedom: 


2 a ——*§ x y?_ 19.31 
ii z= b= 1 Xn-k-1 ( ) 


The proof of Equation (19.31) starts with Equation (19.29). Because U is normally 
distributed, conditional on X, and because My is a symmetric idempotent matrix, the 
quadratic form U'M,U/o? has an exact chi-squared distribution with degrees of 
freedom equal to the rank of My [Equation (19.78) in Appendix 19.2]. From Equa- 
tion (19.26), the rank of My is n — k — 1. Thus U’MyU/o2 has an exact x2- ,-1 
distribution, from which Equation (19.31) follows. 

The degrees-of-freedom adjustment ensures that s4 is unbiased. The expectation 
of a random variable with a y7-,_; distribution is n — k — 1; thus 
E(U'MyV) = (n — k — 1)07,80 E(s%) = o%. 


Homoskedasticity-Only Standard Errors 


The homoskedasticity-only estimator Sa of the covariance matrix of B, conditional 
on X, is obtained by substituting the sample variance s? for the population variance 
g? in the expression for $ glx in Equation (19.30). Accordingly, 


$; = 53(X'X)! (homoskedasticity-only). (19.32) 
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The estimator of the variance of the normal conditional distribution of Ê; given X is 
the (j, j) element of Šg. Thus the homoskedasticity-only standard error of Ê; is the 
square root of the j™ diagonal element of yA That is, the homoskedasticity-only 
standard error of Ê; is 


SECÊ) = V (Ža); (homoskedasticity-only). (19.33) 


Distribution of the t-Statistic 
Let T be the t-statistic testing the hypothesis 8; = B;o, constructed using the homo- 
skedasticity-only standard error; that is, let 
Ê — Bio 
T =———.. (19.34) 
V (Èp) 
Under all six of the extended least squares assumptions in Key Concept 19.1, the 


exact sampling distribution of f is the Student ¢ distribution with n — k — 1 degrees 
of freedom; that is, 


T~ pokon (19.35) 


The proof of Equation (19.35) is given in Appendix 19.4. 


Distribution of the F-Statistic 


If all six least squares assumptions in Key Concept 19.1 hold, then the F-statistic testing 
the hypothesis in Equation (19.20), constructed using the homoskedasticity-only esti- 
mator of the covariance matrix, has an exact F}, ,-,—; distribution under the null 
hypothesis. 


The homoskedasticity-only F-statistic. The homoskedasticity-only F-statistic is simi- 
lar to the heteroskedasticity-robust F-statistic in Equation (19.21) except that the 
homoskedasticity-only estimator $; is used instead of the heteroskedasticity-robust 
estimator Ža Substituting the expression Ši = s3(X'X) | into the expression for the 
F-statistic in Equation (19.21) yields the homoskedasticity-only F-statistic testing the 
null hypothesis in Equation (19.20): 


5 RÊ- TRA) R' T 'RÊ - 9/4 
Si l 


(19.36) 
If all six assumptions in Key Concept 19.1 hold, then under the null hypothesis 
E ~ Fan-k-1- (19.37) 


The proof of Equation (19.37) is given in Appendix 19.4. 
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The F-statistic in Equation (19.36) is called the Wald version of the F-statistic 
(named after the statistician Abraham Wald). Although the formula for the 
homoskedastic-only F-statistic given in Equation (7.13) appears quite different from 
the formula for the Wald statistic in Equation (19.36), the homoskedastic-only 
F-statistic and the Wald F-statistic are two versions of the same statistic. That is, the 
two expressions are equivalent, a result shown in Exercise 19.13. 


Efficiency of the OLS Estimator 
with Homoskedastic Errors 


Under the Gauss-Markov conditions for multiple regression, the OLS estimator of 
P is efficient among all linear conditionally unbiased estimators; that is, the OLS 
estimator is the best linear unbiased estimator (BLUE). 


The Gauss-Markov Conditions for Multiple Regression 


The Gauss—Markov conditions for multiple regression are 


(i) ECU|X) = 0,, 
(ii) E(UU' |X) = o7I,,, and 
(iii) X has full column rank. (19.38) 


The Gauss—Markov conditions for multiple regression in turn are implied by the first 
five assumptions in Key Concept 19.1 [see Equations (19.6) and (19.7)]. The condi- 
tions in Equation (19.38) generalize the Gauss—Markov conditions for a single- 
regressor model to multiple regression. [By using matrix notation, the second and 
third Gauss—Markov conditions in Equation (5.31) are collected into the single con- 
dition (ii) in Equation (19.38).] 


Linear Conditionally Unbiased Estimators 


We start by describing the class of linear unbiased estimators and by showing that 
OLS is in that class. 


The class of linear conditionally unbiased estimators. An estimator of ß is said to 
be linear if it is a linear function of Y,,..., Y,,. Accordingly, the estimator $ is linear 
in Yif it can be written in the form 


B = A'Y, (19.39) 


where A isann X (k + 1) dimensional matrix of weights that may depend on X and 
on nonrandom constants but not on Y. 
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Gauss-Markov Theorem for Multiple Regression 


Suppose that the Gauss—Markov conditions for multiple regression in Equation 


193 


(19.38) hold. Then the OLS estimator B is BLUE. That is, let B be a linear con- 
ditionally unbiased estimator of B, and let c be a nonrandom k + 1 dimensional 


vector. Then var(c’B|X) < var(c'B|X) for every nonzero vector c, where the 


inequality holds with equality for all c only if B = Ê. 


An estimator is conditionally unbiased if the mean of its conditional sampling 
distribution given X is B. That is, B is conditionally unbiased if E(B |X) = B. 


The OLS estimator is linear and conditionally unbiased. Comparison of Equations 
(19.11) and (19.39) shows that the OLS estimator is linear in Y; specifically, Ê= Â'Y, 
where A = X(X'X) '. To show that Bis conditionally unbiased, recall from Equa- 
tion (19.14) that Ê = B + (X'X)!X’U. Taking the conditional expectation of both 
sides of this expression yields E(B|X) = B + E[(X'X) ‘X'U|X] = B + (X'xy! 
X'E(U|X) = B, where the final equality follows because E(U|X) = 0 by the first 
Gauss—Markov condition. 


The Gauss-Markov Theorem for Multiple Regression 


The Gauss—Markov theorem for multiple regression provides conditions under 
which the OLS estimator is efficient among the class of linear conditionally unbiased 
estimators. A subtle point arises, however, because B is a vector and its “variance” is 
a covariance matrix. When the variance of an estimator is a matrix, just what does it 
mean to say that one estimator has a smaller variance than another? 

The Gauss—Markov theorem handles this problem by comparing the variance of a 
candidate estimator of a linear combination of the elements of ß to the variance of the 
corresponding linear combination of Ê. Specifically, let c be a k + 1 dimensional vector, 
and consider the problem of estimating the linear combination c’ B using the candidate 
estimator c’' B (where B isa linear conditionally unbiased estimator) on the one hand and 
c'B on the other hand. Because c’ B and c'B are both scalars and are both linear condi- 
tionally unbiased estimators of c’ B, it now makes sense to compare their variances. 

The Gauss—Markov theorem for multiple regression says that the OLS estimator 
of c'B is efficient; that is, the OLS estimator c'B has the smallest conditional variance 
of all linear conditionally unbiased estimators. Remarkably, this is true no matter 
what the linear combination is. It is in this sense that the OLS estimator is BLUE in 
multiple regression. 

The Gauss—Markov theorem is stated in Key Concept 19.3 and proven in 
Appendix 19.5. 
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Generalized Least Squares' 


The assumption of i.i.d. sampling fits many applications. For example, suppose that 
Y; and X; correspond to information about individuals, such as their earnings, educa- 
tion, and personal characteristics, where the individuals are selected from a popula- 
tion by simple random sampling. In this case, because of the simple random sampling 
scheme, (Xj, Y;) are necessarily i.i.d. Because (X, Y;) and (Xj, Y;) are independently 
distributed for i # j, u; and uj are independently distributed for i # j. This in turn 
implies that u; and u; are uncorrelated fori # j. In the context of the Gauss—Markov 
assumptions, the assumption that E(UU’ | X) is diagonal therefore is appropriate if the 
data are collected in a way that makes the observations independently distributed. 

Some sampling schemes encountered in econometrics do not, however, result in 
independent observations and instead can lead to error terms u; that are correlated 
from one observation to the next. The leading example is when the data are sampled 
over time for the same entity—that is, when the data are time series data. As dis- 
cussed in Section 16.3, in regressions involving time series data, many omitted fac- 
tors are correlated from one period to the next, and this can result in regression 
error terms (which represent those omitted factors) that are correlated from one 
period of observation to the next. In other words, the error term in one period will 
not, in general, be distributed independently of the error term in the next period. 
Instead, the error term in one period could be correlated with the error term in the 
next period. 

The presence of correlated error terms creates two problems for inference based 
on OLS. First, neither the heteroskedasticity-robust nor the homoskedasticity-only 
standard errors produced by OLS provide a valid basis for inference. The solution to 
this problem is to use standard errors that are robust to both heteroskedasticity and 
correlation of the error terms across observations. This topic—heteroskedasticity- 
and autocorrelation-consistent (HAC) covariance matrix estimation—is the subject 
of Section 16.4 and we do not pursue it further here. 

Second, if the error term is correlated across observations, then E(UU' |X) is not 
diagonal, the second Gauss—Markov condition in Equation (19.38) does not hold, 
and OLS is not BLUE. In this section, we study an estimator, generalized least 
squares (GLS), that is BLUE (at least asymptotically) when the conditional covari- 
ance matrix of the errors is no longer proportional to the identity matrix. A special 
case of GLS is weighted least squares, discussed in Section 18.5, in which the condi- 
tional covariance matrix is diagonal and the i" diagonal element is a function of X;. 
Like WLS, GLS transforms the regression model so that the errors of the trans- 
formed model satisfy the Gauss—Markov conditions. The GLS estimator is the OLS 
estimator of the coefficients in the transformed model. 


The GLS estimator was introduced in Section 16.5 in the context of distributed lag time series regression. 
The presentation here is a self-contained mathematical treatment of GLS that can be read independently 
of Section 16.5, but reading that section first will help to make these ideas more concrete. 
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19.4 


In the linear regression model Y = XB + U, the GLS assumptions are 
1. E(U|X) = 0,; 
2. E(UU'|X) = Q(X), where Q(X) is ann X n positive definite matrix that can 
depend on X; 
3. X; and u; satisfy suitable moment conditions; and 


4, X has full column rank (there is no perfect multicollinearity). 


The GLS Assumptions 
There are four assumptions under which GLS is valid. The first GLS assumption is 
that u; has a mean of 0, conditional on X),...,X,,; that is, 

E(U|X) = 0,. (19.40) 


This assumption is implied by the first two least squares assumptions in Key Concept 
19.1; that is, if E(u;|X) = 0 and (X; Y)),i = 1,...,n, areiid.,then E(U|X) = 0,.In 
GLS, however, we will not want to maintain the 1.1.d. assumption; after all, one pur- 
pose of GLS is to handle errors that are correlated across observations. We discuss 
the significance of the assumption in Equation (19.40) after introducing the GLS 
estimator. 

The second GLS assumption is that the conditional covariance matrix of U given 
X is some function of X: 


E(UU'|X) = Q(X), (19.41) 


where Q(X) is ann X n positive definite matrix-valued function of X. 

There are two main applications of GLS that are covered by this assumption. The 
first is independent sampling with heteroskedastic errors, in which case Q(X) is a 
diagonal matrix with diagonal element Ah(X;), where A is a constant and = is a func- 
tion. In this case, discussed in Section 18.5, GLS is WLS. 

The second application is to homoskedastic errors that are serially correlated. In 
practice, in this case a model is developed for the serial correlation. For example, one 
model is that the error term is correlated with only its neighbor, so 
corr(uj, uj-1) = p # 0 but corr (u;,u;) = Oif |i — j| = 2. In this case, Q(X) has o7, 
as its diagonal element, po? in the first off-diagonal, and zeros elsewhere. Thus Q(X) 
does not depend on X, Q; = o}, Qj = po}, for |i — j| = 1, and Q; = 0 for 
|i — j| > 1. Other models for serial correlation, including the first-order autoregres- 
sive model, are discussed further in the context of GLS in Section 16.5 (also see 
Exercise 19.8). 
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One assumption that has appeared on all previous lists of least squares assump- 
tions for cross-sectional data is that X; and u; have nonzero finite fourth moments. In 
the case of GLS, the specific moment assumptions needed to prove asymptotic results 
depend on the nature of the function Q(X), whether Q(X) is known or estimated, 
and the statistic under consideration (the GLS estimator, t-statistic, etc.). Because the 
assumptions are case- and model-specific, we do not present specific moment 
assumptions here, and the discussion of the large-sample properties of GLS assumes 
that such moment conditions apply for the relevant case at hand. For completeness, 
as the third GLS assumption, X; and u; are simply assumed to satisfy suitable moment 
conditions. 

The fourth GLS assumption is that X has full column rank; that is, the regressors 
are not perfectly multicollinear. 

The GLS assumptions are summarized in Key Concept 19.4. 

We consider GLS estimation in two cases. In the first case, Q(X) is known. In 
the second case, the functional form of Q(X) is known up to some parameters that 
can be estimated. To simplify notation, we refer to the function Q(X) as the matrix 
Q, so the dependence of Q on X is implicit. 


GLS When Q Is Known 


When Q is known, the GLS estimator uses © to transform the regression model to 
one with errors that satisfy the Gauss—Markov conditions. Specifically, let F be a 
matrix square root of (7; that is, let F be a matrix that satisfies F'F = Q' (see 
Appendix 19.1). A property of F is that FOF’ = I. Now premultiply both sides of 
Equation (19.3) by F to obtain 


Y = Yp+U, (19.42) 


where Y = FY, X = FX, and U = FU. 

The key insight of GLS is that, under the four GLS assumptions, the Gauss—Markov 
assumptions hold for the transformed regression in Equation (19.42). That is, by 
transforming all the variables by the matrix square root of the inverse of ©, the 
regression errors in the transformed regression have a conditional mean of 0 
and a covariance matrix that equals the identity matrix. To show this mathemat- 
ically, first note that E(U|X) = E(FU| FX) = FE(U|FX) = 0, by the first GLS 
assumption [Equation (19.40)]. In addition, E(UU'|X) = E[(FU)(FU)'| FX] = 
FE(UU'|FX)F' = FQF' = 1,, where the second equality follows because 
(FU)' = U'F' and the final equality follows from the definition of F. It follows that 
the transformed regression model in Equation (19.42) satisfies the Gauss-Markov 
conditions in Key Concept 19.3. 

The GLS estimator, BC“, is the OLS estimator of B in Equation (19.42); that is, 
p= (X'X) (XY). Because the transformed regression model satisfies the 
Gauss—Markov conditions, the GLS estimator is the best conditionally unbiased 
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estimator that is linear in Y. But because Y = FY and F is (here) assumed to be 
known and because F is invertible (because © is positive definite), the class of esti- 
mators that are linear in Y is the same as the class of estimators that are linear in Y. 
Thus the OLS estimator of B in Equation (19.42) is also the best conditionally unbi- 
ased estimator among estimators that are linear in Y. In other words, under the GLS 
assumptions, the GLS estimator is BLUE. 

The GLS estimator can be expressed directly in terms of ©, so in principle there 


is no need to compute the square root matrix F. Because X = FX and 
Y = FY, BOS = (X'F'FX) |(X'F'FY). But F'F = Q, so 


BOES = (X'O XY X' QY). (19.43) 


In practice, Q is typically unknown, so the GLS estimator in Equation (19.43) typi- 
cally cannot be computed and thus is sometimes called the infeasible GLS estimator. 
If, however, Q has a known functional form but the parameters of that function are 
unknown, then Q can be estimated, and a feasible version of the GLS estimator can 
be computed. 


GLS When Q Contains Unknown Parameters 


If Q is a known function of some parameters that in turn can be estimated, then 
these estimated parameters can be used to calculate an estimator of the covariance 
matrix Q. For example, consider the time series application discussed following 
Equation (19.41), in which Q(X) does not depend on X, Q; = o}, Q = poi for 
|i — j| = 1,and Q; = 0 for |i — j| > 1. Then Q has two unknown parameters, 07, 
and p. These parameters can be estimated using the residuals from a preliminary 
OLS regression; specifically, 7 can be estimated by S, and p can be estimated by the 
sample correlation between all neighboring pairs of OLS residuals. These estimated 
parameters can in turn be used to compute an estimator of Q, Ô. 

In general, suppose that you have an estimator Ô of Q.Then the GLS estimator 
based on Q is 


BC% = Wo yy wo'y), (19.44) 


The GLS estimator in Equation (19.44) is sometimes called the feasible GLS estima- 
tor because it can be computed if the covariance matrix contains some unknown 
parameters that can be estimated. 


The Conditional Mean Zero Assumption and GLS 


For the OLS estimator to be consistent, the first least squares assumption must hold; 
that is, E(u;|X;) must be 0. In contrast, the first GLS assumption is that 
E(u;|X,, ...,X,) = 0. In other words, the first OLS assumption is that the error for 
the i" observation has a conditional mean of 0 given the values of the regressors for 


732 


CHAPTER 19 The Theory of Multiple Regression 


that observation, whereas the first GLS assumption is that u; has a conditional mean 
of 0 given the values of the regressors for all observations. 

As discussed in Section 19.1, the assumptions that E(u,| X;) = 0 and that sam- 
pling is i.i.d. together imply that E(u;|X,,...,X,) = 0. Thus, when sampling is i.i.d., 
so that GLS is WLS, the first GLS assumption is implied by the first least squares 
assumption in Key Concept 19.1. 

When sampling is not 1.i.d., however, the first GLS assumption is not implied by 
the assumption that E(u;|X;) = 0; that is, the first GLS assumption is stronger. 
Although the distinction between these two conditions might seem slight, it can be 
very important in applications to time series data. This distinction is discussed in 
Section 16.5 in the context of whether the regressor is “past and present” exogenous 
or “strictly” exogenous; the assumption that E(u;| X4, ..., X,,) = 0 corresponds to 
strict exogeneity. Here, we discuss this distinction at a more general level using matrix 
notation. To do so, we focus on the case that U is homoskedastic, Q is known, and Q 
has nonzero off-diagonal elements. 


The role of the first GLS assumption. To see the source of the difference between 
these assumptions, it is useful to contrast the consistency arguments for GLS and OLS. 

We first sketch the argument for the consistency of the GLS estimator in Equa- 
tion (19.43). Substituting Equation (19.3) into Equation (19.43), we have BC! = 
B + (X'O1X/ny'(X'Q1U/n). Under the first GLS assumption, E(X'Q U) = 
E[X'O 'E(U|X)] = 0,. If in addition the variance of X'Q7'U/n tends to 0 
and X'Q1X/n — O, where O is some invertible matrix, then BOS > B. 
Critically, when Q has off-diagonal elements, the term X’Q7U = 
Sa EX (Qu involves products of X; and u; for different i, j pairs, where 
(71); denotes the (i, j) element of Q7'. Thus, for ¥'Q7'U to have a mean of 0, it is 
not enough that E(u;|X;) = 0; rather, E(u;|X;) must equal 0 for all i, j pairs corre- 
sponding to nonzero values of (QD Depending on the covariance structure of the 
errors, only some of or all the elements of (D; might be nonzero. For example, 
if u; follows a first-order autoregression (as discussed in Section 16.5), the only non- 
zero elements (O'); are those for which |i — j| = 1. In general, however, all the 
elements of (7! can be nonzero, so, in general, for X'Q°'U/n — O(.+1)x1 (and 
thus for B°5 to be consistent), we need that E(U|X) = 0,,; that is, the first GLS 
assumption must hold. 

In contrast, recall the argument that the OLS estimator is consistent. Rewrite 
Equation (19.14) as Ê = B + (X'X/n) ! kD, X;u; If E(u;| X) = 0, then the term 
iS_ Xu; has mean 0, and if this term has a variance that tends to 0, it converges in 
probability to 0. If in addition X/X/n ——> Qx, then B > B. 


Is the first GLS assumption restrictive? The first GLS assumption requires that the 
errors for the i‘ observation be uncorrelated with the regressors for all other obser- 
vations. This assumption is dubious in some time series applications. This issue is 
discussed in Section 16.6 in the context of an empirical example, the relationship 
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between the change in the price of a contract for future delivery of frozen orange 
concentrate and the weather in Florida. As explained there, the error term in the 
regression of price changes on the weather is plausibly uncorrelated with current and 
past values of the weather, so the first OLS assumption holds. However, this error 
term is plausibly correlated with future values of the weather, so the first GLS 
assumption does not hold. 

This example illustrates a general phenomenon in economic time series data that 
arises when the value of a variable today is set in part based on expectations of the future: 
Those future expectations typically imply that the error term today depends on a forecast 
of the regressor tomorrow, which in turn is correlated with the actual value of the regres- 
sor tomorrow. For this reason, the first GLS assumption is, in fact, much stronger than the 
first OLS assumption. Accordingly, in some applications with economic time series data, 
the GLS estimator is not consistent even though the OLS estimator is. 


Instrumental Variables and Generalized 
Method of Moments Estimation 


This section provides an introduction to the theory of instrumental variables (IV) 
estimation and the asymptotic distribution of IV estimators. It is assumed throughout 
that the IV regression assumptions in Key Concepts 12.3 and 12.4 hold and, more- 
over, that the instruments are strong. These assumptions apply to cross-sectional data 
with i.i.d. observations. Under certain conditions, the results derived in this section 
are applicable to time series data as well, and the extension to time series data is 
briefly discussed at the end of this section. All asymptotic results in this section are 
developed under the assumption of strong instruments. 

This section begins by presenting the IV regression model and the two stage least 
squares (TSLS) estimator and its asymptotic distribution in the general case of het- 
eroskedasticity, all in matrix form. It is next shown that, in the special case of homo- 
skedasticity, the TSLS estimator is asymptotically efficient among the class of IV 
estimators in which the instruments are linear combinations of the exogenous vari- 
ables. Moreover, the J-statistic has an asymptotic chi-squared distribution in which 
the degrees of freedom equals the number of overidentifying restrictions. This sec- 
tion concludes with a discussion of efficient IV estimation and the test of overiden- 
tifying restrictions when the errors are heteroskedastic—a situation in which the 
efficient IV estimator is known as the efficient generalized method of moments 
(GMM) estimator [Hansen (1983)]. 


The IV Estimator in Matrix Form 

In this section, we let X denote the n X (k + r + 1) matrix of the regressors in the 
equation of interest, so X contains the included endogenous regressors (the X’s in Key 
Concept 12.1) and the included exogenous regressors (the W’s in Key Concept 12.1). 
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That is, in the notation of Key Concept 12.1, the i? row of X is X; = Xi Xi 
Xi Wii Wa ... Wp). Also, let Z denote the n X (m + r + 1) matrix of all the 
exogenous regressors, both those included in the equation of interest (the W’s) and 
those excluded from the equation of interest (the instruments). That is, in the nota- 
tion of Key Concept 12.1, the i row of Z is Z} = (1 Z; Z% ... Zmi 
Wii Wa ... Wp. 

With this notation, the IV regression model of Key Concept 12.1, written in 
matrix form, is 


Y=Xp+U, (19.45) 


where U is the n X 1 vector of errors in the equation of interest, with i" element u;. 
The matrix Z consists of all the exogenous regressors, so under the IV regression 
assumptions in Key Concept 12.4, 


E(Zju;) = 0 (instrument exogeneity). (19.46) 


Because there are k included endogenous regressors, the first stage regression con- 
sists of k equations. 


The TSLS estimator. The TSLS estimator is the instrumental variables estimator in 
which the instruments are the predicted values of X based on OLS estimation of the 
first-stage regression. Let X denote this matrix of predicted values, so that the i" row 
of Xis Ce: X; pi Ka Wi; Wr ... Wn), where Šis the predicted value from 
the regression of X4; on Z and so forth. Because the W’s are contained in Z, the pre- 
dicted value from a regression of W4; on Z is just W4; and so forth, so x= PX, where 
Pz = Z(Z'Z) 'Z’' [see Equation (19.27)]. Accordingly, the TSLS estimator is 


pee airy ry, (19.47) 


Because £ = P,X,X'X = X'P;X,and X'Y = X'P;Y, the TSLS estimator can be 
rewritten as 


BTS"S = (X' PX) 1X' PY. (19.48) 


Asymptotic Distribution of the TSLS Estimator 


Substituting Equation (19.45) into Equation (19.48), rearranging, and multiplying by 
Vn yields the expression for the centered and scaled TSLS estimator: 


is X'P;X\"! X'PZU 
Vam- B) = ( 7 ) - 
n Vn 

B X2(22\" zx)" X'Z zz) Z'U 

p n n n n n Va 

where the second equality uses the definition of Pz. Under the IV regression assump- 
tions, X'Z/n ——> Qyz and Z'Z/n —— Qzz, where Qxz= E(X,Z}) and 
Qzz = E(Z,Z;). In addition, under the IV regression assumptions, Z;u; is i.i.d. with 


| (19.49) 
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mean 0 [Equation (19.46)] and a positive definite covariance matrix, so its sum, 
divided by Vn, satisfies the conditions of the multivariate central limit theorem 
(Key Concept 19.2) and 


Z'U/Vn © Yzy, where Yzy ~ N(0, H), H = E(Z;Z!u® (19.50) 


and Wzyis(m+rt+1)X1. 

Application of Equation (19.50) and of the limits X¥’Z/n —— Qyxz and 
Z'Z/n —> Q7z to Equation (19.49) yields the result that, under the IV regression 
assumptions, the TSLS estimator is asymptotically normally distributed: 


Vn (BTS — B) > (Oxz07O7x) 'OxzO7e¥ zu ~ N(0, 275), (19.51) 


where 


LES = (Qy7O7707x) 'QxzQ0zz HO77 Ozx (Oxz07707x) |, (19.52) 
where H is defined in Equation (19.50). 


Standard errors for TSLS. The formula in Equation (19.52) is daunting. Nevertheless, 
it provides a way to estimate $7945 by substituting sample moments for the popula- 
tion moments. The resulting variance estimator is 


[oo = (Ox707,07y) 10 y7O7 HOO zx (OxzO874021) 1, (19.53) 


where Oy, = X'Z/n, Ozz = Z'Z/n, Ozy = Z'XIn,and 


HA = — Ý Z;Z;û?, where U = Y — XBT", (19.54) 
i=1 


so that U is the vector of TSLS residuals, and where ii; is the i‘* element of that vector 
(the TSLS residual for the i™ observation). 

The TSLS standard errors are the square roots of the diagonal elements of 
$ TSLS Jp, 


Properties of TSLS When the Errors Are Homoskedastic 


If the errors are homoskedastic, then the TSLS estimator is asymptotically efficient 
among the class of IV estimators in which the instruments are linear combinations 
of the rows of Z. This result is the IV counterpart to the Gauss—Markov theorem and 
constitutes an important justification for using TSLS. 


The TSLS distribution under homoskedasticity. If the errors are homoskedastic— that is, 
if E(u;|Z) = of —then H = E(Z,Zjuj) = E[E(Z;Zjuj|Z,)] = E[Z;Zj/E(u; |Z] = 
Q7z07,.In this case, the variance of the asymptotic distribution of the TSLS estimator 
in Equation (19.52) simplifies to 


LTS = (Qy7O74Oz7x) ‘02 (homoskedasticity only). (19.55) 
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The homoskedasticity-only estimator of the TSLS variance matrix is 


ES = (Qx7O72Ozx) ‘Gi. where ĉi = ~ 


(homoskedasticity only), (19.56) 


and the homoskedasticity-only TSLS standard errors are the square roots of the 
diagonal elements of }7°45/n. 


The class of IV estimators that use linear combinations of Z. The class of IV estima- 
tors that use linear combinations of Z as instruments can be generated in two equiva- 
lent ways. Both start with the same moment equation: Under the assumption of 
instrument exogeneity, the errors U = Y — XP are uncorrelated with the exogenous 
regressors; that is, at the true value of B, Equation (19.46) implies that 


E[(Y — XB)'Z] = 0. (19.57) 


Equation (19.57) constitutes a system of m + r + 1 equations involving the 
k + r + 1 unknown elements of B. When m > k, these equations are redundant in 
the sense that all are satisfied at the true value of B. When these population moments 
are replaced by their sample moments, the system of equations (Y — Xb)'Z = 0 can 
be solved for b when there is exact identification (m = k). This value of b is the IV 
estimator of B. However, when there is overidentification (m > k), the equations in 
the system cannot be simultaneously satisfied by the same value of b because of 
sampling variation—there are more equations than unknowns —and, in general, this 
system does not have a solution. 

The first approach to the problem of estimating B when there is overidentifica- 
tion is to trade off the desire to satisfy each equation by minimizing a quadratic form 
involving all the equations. Specifically, let A be an (m + r+1)X (m+r-+1) 


V 


symmetric positive semidefinite weight matrix, and let Ê} denote the estimator that 


minimizes 
min,(Y — Xb)'ZAZ'(Y — Xb). (19.58) 


The solution to this minimization problem is found by taking the derivative of the 
objective function with respect to b, setting the result equal to 0, and rearranging. 
Doing so yields BV the IV estimator based on the weight matrix A: 


ÊX = (X'ZAZ'X)'X'ZAZ'Y. (19.59) 
Comparison of Equations (19.59) and (19.48) shows that the TSLS estimator is the 


IV estimator with A = (Z'Z) '. That is, TSLS is the solution of the minimization 
problem in Equation (19.58) with A = (Z'Z) 1. 
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The calculations leading to Equations (19.51) and (19.52), applied to ÊL V show that 
Vn( BY - B) > N(0, %%), where 
XW = (QxzAQzx) 'OxzAHAQzx (QxzAQzx) |. (19.60) 


The second way to generate the class of IV estimators that use linear combina- 
tions of Z is to consider IV estimators in which the instruments are ZB, where B is 
an (m +r + 1) X (k +r + 1) matrix with full column rank. Then the system of 
(k + r + 1) equations, (Y — Xb)'ZB = 0,can be solved uniquely for the (k + r + 1) 
unknown elements of b. Solving these equations for b yields a = (B'Z'X)|\(B'Z'Y), 
and substitution of B = AZ'X into this expression yields Equation (19.59). 

Thus the two approaches to defining IV estimators that are linear combinations 
of the instruments yield the same family of IV estimators. It is conventional to work 
with the first approach, in which the IV estimator solves the quadratic minimization 
problem in Equation (19.58), and that is the approach taken here. 


Asymptotic efficiency of TSLS under homoskedasticity. If the errors are homoske- 
dastic, then H = Q7zc7, and the expression for $4 in Equation (19.60) becomes 


EY = (OxzAQzx) 'Ox7zAO77AOzx (Ox7AO7x) "0% (19.61) 


To show that TSLS is asymptotically efficient among the class of estimators that are 
linear combinations of Z when the errors are homoskedastic, we need to show that, 
under homoskedasticity, 


eSVoS Sc (19.62) 


for all positive semidefinite matrices A and all (k + r + 1) X 1 vectors c, where 
XPS = (Qy7O74Ozx) ‘07, [Equation (19.55)]. The inequality (19.62), which is 
proven in Appendix 19.6, is the same efficiency criterion as is used in the multivariate 
Gauss—Markov theorem in Key Concept 19.3. Consequently, TSLS is the efficient IV 
estimator under homoskedasticity among the class of estimators in which the instru- 
ments are linear combinations of Z. 


The J-statistic under homoskedasticity. The J-statistic (Key Concept 12.6) tests the 
null hypothesis that all the overidentifying restrictions hold against the alternative 
that some or all of them do not hold. 

The idea of the J-statistic is that, if the overidentifying restrictions hold, u; will be 
uncorrelated with the instruments, and thus a regression of U on Z will have population 
regression coefficients that all equal 0. In practice, U is not observed, but it can be estimated 
by the TSLS residuals U,soa regression of U on Z should yield statistically insignificant 
coefficients. Accordingly, the TSLS J-statistic is the homoskedasticity-only F-statistic testing 
the hypothesis that the coefficients on Z are all 0, in the regression of U on Z, multi- 
plied by (m + r + 1)so that the F-statistic is in its asymptotic chi-squared form. 
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An explicit formula for the J-statistic can be obtained using Equation (7.13) 
for the homoskedasticity-only F-statistic. The unrestricted regression is the regres- 
sion of U onthe m+r+1 regressors Z, and the restricted regression has no 
regressors. Thus, in the notation of Equation (7. 13), SSRunrestricted = U' ‘MzU, and 
SSR restricted = U' U, so SSRrestricted — SSR unrestricted = U' U- U' M,U = U'P,U and 
the J-statistic is 

U'P,U 
J=- : Z : (19.63) 
U'MzU/(n -m =r- 1) 


The method for computing the J-statistic described in Key Concept 12.6 entails 


testing only the hypothesis that the coefficients on the excluded instruments are 0. 
Although these two methods have different computational steps, they produce iden- 
tical J-statistics (Exercise 19.14). 

It is shown in Appendix 19.6 that, under the null hypothesis that E(u;Z;) = 0 


Joy, (19.64) 


Generalized Method of Moments Estimation 
in Linear Models 


If the errors are heteroskedastic, then the TSLS estimator is no longer efficient 
among the class of IV estimators that use linear combinations of Z as instruments. 
The efficient estimator in this case is known as the efficient generalized method of 
moments (GMM) estimator. In addition, if the errors are heteroskedastic, then the 
J-statistic as defined in Equation (19.63) no longer has a chi-squared distribution. How- 
ever, an alternative formulation of the J-statistic, constructed using the efficient GMM 
estimator, does have a chi-squared distribution with m — k degrees of freedom. 

These results parallel the results for the estimation of the usual regression model 
with exogenous regressors and heteroskedastic errors: If the errors are heteroskedas- 
tic, then the OLS estimator is not efficient among estimators that are linear in Y (the 
Gauss—Markov conditions are not satisfied), and the homoskedasticity-only F-statistic 
no longer has an F distribution, even in large samples. In the regression model with 
exogenous regressors and heteroskedasticity, the efficient estimator is weighted least 
squares; in the IV regression model with heteroskedasticity, the efficient estimator 
uses a different weighting matrix than TSLS, and the resulting estimator is the efficient 
GMM estimator. 


GMM estimation. Generalized method of moments (GMM) estimation is a general 
method for the estimation of the parameters of linear or nonlinear models, in which 
the parameters are chosen to provide the best fit to multiple equations, each of which 
sets a sample moment to 0. These equations, which in the context of GMM are called 
moment conditions, typically cannot all be satisfied simultaneously. The GMM esti- 
mator trades off the desire to satisfy each of the equations by minimizing a quadratic 
objective function. 
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In the linear IV regression model with exogenous variables Z, the class of GMM 
estimators consists of all the estimators that are solutions to the quadratic minimiza- 
tion problem in Equation (19.58). Thus the class of GMM estimators based on the 
full set of instruments Z with different-weight matrices A is the same as the class of 
IV estimators in which the instruments are linear combinations of Z. In the linear IV 
regression model, GMM is just another name for the class of estimators we have 
been studying —that is, estimators that solve Equation (19.58). 


The asymptotically efficient GMM estimator. Among the class of GMM estimators, 
the efficient GMM estimator is the GMM estimator with the smallest asymptotic 
variance matrix [where the smallest variance matrix is defined as in Equation (19.62)]. 
Thus the result in Equation (19.62) can be restated as saying that TSLS is the efficient 
GMM estimator in the linear model when the errors are homoskedastic. 

To motivate the expression for the efficient GMM estimator when the errors are 
heteroskedastic, recall that when the errors are homoskedastic, H [the variance 
matrix of Z;u; see Equation (19.50)] equals @zz07, and the asymptotically efficient 
weight matrix is obtained by setting A = (Z'Z) |, which yields the TSLS estimator. 
In large samples, using the weight matrix A = (Z'Z)! is equivalent to using 
A = (Qz707) | = H!. This interpretation of the TSLS estimator suggests that, by 
analogy, the efficient IV estimator under heteroskedasticity can be obtained by 
setting A = H™! and solving 


min,(Y — Xb)'ZH~'Z'(Y — Xb). (19.65) 


This analogy is correct: The solution to the minimization problem in Equation (19.65) 
is the efficient GMM estimator. Let pEFOMM denote the solution to the minimization 
problem in Equation (19.65). By Equation (19.59), this estimator is 


BEFOMM = (X'ZH"'Z'X)'X'ZH'Z'Y. (19.66) 


The asymptotic distribution of B°°"™ is obtained by substituting A = H™ into 
Equation (19.60) and simplifying; thus 


Vin Bato — p) -=> N(0, 3E7204, 
where ZECMM — (QyzH™'Qzy). (19.67) 
The result that BEFCMM is the efficient GMM estimator is proven by showing that 


Ee = cl ZOMM for all vectors c, where $} is given in Equation (19.60). The 


proof of this result is given in Appendix 19.6. 


Feasible efficient GMM estimation. The GMM estimator defined in Equation 
(19.66) is not a feasible estimator because it depends on the unknown variance 
matrix H. However, a feasible efficient GMM estimator can be computed by 
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substituting a consistent estimator of H into the minimization problem of Equation 
(19.65) or, equivalently, by substituting a consistent estimator of H into the formula 
for BE!-OM™ in Equation (19.66). 

The efficient GMM estimator can be computed in two steps. In the first step, 
estimate $ using any consistent estimator. Use this estimator of B to compute the 
residuals from the equation of interest, and then use these residuals to compute an 
estimator of H. In the second step, use this estimator of H to estimate the optimal 
weight matrix H! and to compute the efficient GMM estimator. To be concrete, in 
the linear IV regression model, it is natural to use the TSLS estimator in the first step 
and to use the TSLS residuals to estimate H. If TSLS is used in the first step, then the 
feasible efficient GMM estimator computed in the second step is 


BEESMM = (X'7ZH"Z'X)"X'ZH"Z'Y, (19.68) 


where H is given in Equation (19.54). 
Because H —2> H, Vn (BEF OMM — BEICMM) —P> 0 (Exercise 19.12), and 


Vn(ĝEf emm - B) 4, NO, ZEFOMM), (19.69) 


where X #6" = (Qy,H 'Q7y) | [Equation (19.67)]. That is, the feasible two-step 
estimator poem in Equation (19.68) is, asymptotically, the efficient GMM 
estimator. 


The heteroskedasticity-robust J-statistic. The heteroskedasticity-robust J-statistic, 
also known as the GMM J-statistic, is the counterpart of the TSLS-based J-statistic, 
computed using the efficient GMM estimator and weight function. That is, the GMM 
J-statistic is given by 


^ GMM. 


JOMM = (Z'U 


^ GMM. 


yA Zo’) /n, (19.70) 


where UCM = y — xp eum are the residuals from the equation of interest, esti- 
mated by (feasible) efficient GMM, and Ĥ™ is the weight matrix used to compute 
ĜEEOMM, 


Under the null hypothesis E(Zu) = 0, J°¢™ ats Xz (see Appendix 19.6). 


GMM with time series data. The results in this section were derived under the IV 
regression assumptions for cross-sectional data. In many applications, however, these 
results extend to time series applications of IV regression and GMM. Although a 
formal mathematical treatment of GMM with time series data is beyond the scope 
of this book (for such a treatment, see Hayashi, 2000, Chapter 6), we nevertheless 
will summarize the key ideas of GMM estimation with time series data. This sum- 
mary assumes familiarity with the material in Chapters 14 and 16. For this discussion, 
it is assumed that the variables are stationary. 
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It is useful to distinguish between two types of applications: applications in which the 
error term wis serially correlated and applications in which wis serially uncorrelated. If 
the error term u, is serially correlated, then the asymptotic distribution of the GMM 
estimator continues to be normally distributed, but the formula for H in Equation (19.50) 
is no longer correct. Instead, the correct expression for H depends on the autocovari- 
ances of Zu, and is analogous to the formula given in Equation (16.14) for the variance 
of the OLS estimator when the error term is serially correlated. The efficient GMM 
estimator is still constructed using a consistent estimator of H; however, that consistent 
estimator must be computed using the HAC methods discussed in Chapter 16. 

If Z,u; is not serially correlated, then HAC estimation of H is unnecessary, and the 
formulas presented in this section all extend to time series GMM applications. In 
modern applications to finance and macroeconometrics, it is common to encounter 
models in which the error term represents an unexpected or unforecastable distur- 
bance, in which case the model typically implies that Z,u, is serially uncorrelated. For 
example, consider a model with a single included endogenous variable and no included 
exogenous variables so that the equation of interest is Y, = By + BX; + u, Suppose 
that an economic theory implies that u,is unpredictable given past information. Then 
the theory implies the moment condition 


E(u, Y, Xt J> Z, 1> Y, 2° Xı 23 Zi V: =) = 0, (19.71) 


where Z, is the lagged value of some other variable. The moment condition in 
Equation (19.71) implies that all the lagged variables Y,-1, X -1, Z;-1, Y -2, X -2 
Zı-2, ... are candidates for being valid instruments (they satisfy the exogeneity con- 


dition). Moreover, because u,_; = Y;-; — By — B,X;-1, the moment condition in 
Equation (19.71) is equivalent to E(u; |u;—1, X;—1, Z;-1, Ut-2, X;—-2, Z-z...) = 0. 
Because u, is serially uncorrelated, HAC estimation of H is unnecessary. The theory 


of GMM presented in this section, including efficient GMM estimation and the 
GMM J-statistic, therefore applies directly to time series applications with moment 
conditions of the form in Equation (19.71), under the hypothesis that the moment 
condition in Equation (19.71) is, in fact, correct. 


Summary 


1. The linear multiple regression model in matrix form is Y = XB + U, where 
Y is the n X 1 vector of observations on the dependent variable, X is the 
n X (k + 1) matrix of n observations on the k + 1 regressors (including a 
constant), B is the k + 1 vector of unknown parameters, and U is the n X 1 
vector of error terms. 

2. The OLS estimator is B = (X'X)'X'Y. Under the first four least squares 
assumptions in Key Concept 19.1, B is consistent and asymptotically normally 
distributed. If in addition the errors are homoskedastic, then the conditional 
variance of Ê is var(B |X) = 02(X'X)1. 
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General linear restrictions on B can be written as the q equations RB = r, and 
this formulation can be used to test joint hypotheses involving multiple coef- 
ficients or to construct confidence sets for elements of B. 

When the regression errors are i.i.d. and normally distributed, conditional on 
X, B has an exact normal distribution, and the homoskedasticity-only t- and 
F-statistics have exact f,_,—, and F, ,-,—1 distributions, respectively. 

The Gauss—Markov theorem says that, if the errors are homoskedastic and condi- 
tionally uncorrelated across observations and if E(u|X) = 0, the OLS estimator is 
efficient among linear conditionally unbiased estimators (that is, OLS is BLUE). 

If the error covariance matrix Q is not proportional to the identity matrix and 
if © is known or can be estimated, then the GLS estimator is asymptotically 
more efficient than OLS. However, GLS requires that, in general, u; be uncor- 
related with all observations on the regressors, not just with X;, as is required 
by OLS, an assumption that must be evaluated carefully in applications. 

The TSLS estimator is a member of the class of GMM estimators of the linear 
model. In GMM, the coefficients are estimated by making the sample covari- 
ance between the regression error and the exogenous variables as small as 
possible —specifically, by solving min, | (Y — Xb)'Z]A[Z'(Y — Xb) ], where A 
is anon-random positive definite matrix. The asymptotically efficient GMM esti- 
mator sets A = [ E(Z,Z/ u?)]~!. When the errors are homoskedastic, the asymp- 
totically efficient GMM estimator in the linear IV regression model is TSLS. 


Key Terms 
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Exercises 743 


Review the Concepts 


19.1 


19.2 


19.3 


19.4 


19.5 


A researcher studying the relationship between earnings and workers’ sex speci- 
fies the regression model Y; = By + Xubi + X28 + u; where Xj; is a binary 
variable that equals 1 if the i™ person is a female and X; is a binary variable 
that equals 1 if the i‘ person is a male. Write the model in the matrix form of 
Equation (19.2) for a hypothetical set of n = 5 observations. Show that the col- 
umns of X are linearly dependent, so that X does not have full rank. Explain how 
you would respecify the model to eliminate the perfect multicollinearity. 


You are analyzing a linear regression model with 500 observations and one 
regressor. Explain how you would construct a confidence interval for A, if 


a. Assumptions 1 through 4 in Key Concept 19.1 are true but you think 
assumption 5 or 6 might not be true. 


b. Assumptions 1 through 5 are true but you think assumption 6 might not 
be true. (Give two ways to construct the confidence interval.) 


c. Assumptions 1 through 6 are true. 


Suppose that assumptions 1 through 5 in Key Concept 19.1 are true but that 
assumption 6 is not. Does the result in Equation (19.31) hold? Explain. 


When is the GLS estimator more efficient than the OLS estimator within the 
class of linear conditionally unbiased estimators? 


Construct an example of a regression model that satisfies the assumption 
E(u; | X;) = 0 but for which E(U| X) 4 0,. 


Exercises 


19.1 


19.2 


Consider the population regression of test scores against income and the 
square of income in Equation (8.1). 


a. Write the regression in Equation (8.1) in the matrix form of Equation 
(19.5). Define Y, X, U, and B. 

b. Explain how to test the null hypothesis that the relationship between test 
scores and income is linear against the alternative that it is quadratic. Write 
the null hypothesis in the form of Equation (19.20). What are R, r, and q? 


Suppose that a sample of n = 20 households has the sample means and sam- 
ple covariances below for a dependent variable and two regressors: 


Sample Covariances 


Sample Means Y Xx, X 

Y 6.39 0.26 0.22 0.32 

| x 7.24 0.80 0.28 
E: 4.00 2.40 


744 CHAPTER 19 The Theory of Multiple Regression 


19.3 


19.4 


19.5 


19.6 


19.7 


a. Calculate the OLS estimates of Bo, 61, and 6». Calculate s3. Calculate the 
R? of the regression. 


b. Suppose that all six assumptions in Key Concept 19.1 hold. Test the 
hypothesis that B, = 0 at the 5% significance level. 


Let Wbe an m X 1 vector with covariance matrix } w, where > w is finite and 
positive definite. Let c be a nonrandom m X 1 vector, and let Q = c'W. 


a. Show that var(Q) = c'È wc. 
b. Suppose that c # 0,,. Show that 0 < var(Q) < œ. 


Consider the regression model Y; = By + BX; + u; from Chapter 4, and 
assume that the least squares assumptions in Key Concept 4.3 hold. 


a. Write the model in the matrix form given in Equations (19.2) and (19.3). 

b. Show that assumptions 1 through 4 in Key Concept 19.1 are satisfied. 

c. Use the general formula for B in Equation (19.11) to derive the expressions 
for By and £; given in Key Concept 4.2. 

d. Show that the (1, 1) element of gin Equation (19.13) is equal to the 
expression for UA given in Key Concept 4.4. 


Let Py and My be as defined in Equations (19.24) and (19.25). 


a. Prove that PyMy = 0, x n and that Py and My are idempotent. 
b. Derive Equations (19.27) and (19.28). 


c. Show that rank(Py) = k + 1 and rank(My) = n — k — 1.[Hint: First 
solve Exercise 19.10, and then use the fact that trace(AB) = trace(BA) 
for conformable matrices A and B.] 


Consider the regression model in matrix form, Y = XB + Wy + U, where 
X is ann X kı matrix of regressors and Wis ann X ky matrix of regressors. 
Then, as shown in Exercise 19.17, the OLS estimator B can be expressed 


Ê = (X'MyX Y! (X'MwY). 


Now let 62” be the “binary variable” fixed effects estimator computed by 
estimating Equation (10.11) by OLS, and let BP™ be the “demeaning” fixed 
effects estimator computed by estimating Equation (10.14) by OLS, in which 
the entity-specific sample means have been subtracted from X and Y. Use the 
expression for B given above to prove that BEY = ee [| Hint: Write Equation 
(10.11) using a full set of fixed effects, D1;, D2;,..., Dn; and no constant term. 
Include all of the fixed effects in W. Write out the matrix MyX.] 


Consider the regression model Y; = BX; + BW; + u;, where for simplicity 
the intercept is omitted and all variables are assumed to have a mean of 0. 
Suppose that X; is distributed independently of (W;, u;) but W; and u; might 
be correlated, and let Bi and Bo be the OLS estimators for this model. 


19.8 


19.9 
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a. Show that whether or not W; and u; are correlated, Ê — B. 
b. Show that if W; and u; are correlated, then Ê is inconsistent. 


c. Let Bj be the OLS estimator from the regression of Y on X (the restricted 
regression that excludes W). Will GB, have a smaller asymptotic variance 
than Bi, allowing for the possibility that W; and u; are correlated? Explain. 


Consider the regression model Y, = By + BX; + u; where u; = u and 
u; = 0.5u;_; + u,;fori = 2,3,...,n. Suppose that uv; are i.i.d. with mean 0 
and variance 1 and are distributed independently of X; for all i and j. 


a. Derive an expression for E(UU') = Q. 


b. Explain how to estimate the model by GLS without explicitly inverting 
the matrix ©. (Hint: Transform the model so that the regression errors 
are U1, Un, ..., Uy.) 


This exercise shows that the OLS estimator of a subset of the regression 
coefficients is consistent under the conditional mean independence assump- 
tion stated in Key Concept 6.6. Consider the multiple regression model in 
matrix form Y = XB + Wy + U, where X and W are, respectively, n X kı 
and n X k, matrices of regressors. Let X; and W; denote the i rows of X 
and W [as in Equation (19.4)]. Assume that (i) E(u|X;, W) = W/6, where 6 
is a ky X 1 vector of unknown parameters; (ii) (X;, W; Y;) are i.i.d.; (iti) (X;, W;, u;) 
have four finite nonzero moments; and (iv) there is no perfect multicollinearity. 
These are assumptions 1 through 4 of Key Concept 19.1, with the conditional mean 
independence assumption (i) replacing the usual conditional mean 0 assumption. 


a. Use the expression for B given in Exercise 19.6 to write B - p= 
(n-1X'MyX) '(n 1X'MyWV). 

b. Show that n'X'MyX —— Eyy — SywEwWwEwy, where È yy = 
E(X;X!), Sxw = E(X;W)), and so forth. [The matrix A, —?> A if 
Anij — Aj for all i, j pairs, where A, ;j and A; are the (i, j) elements of 
A,, and A.] 

c. Show that assumptions (i) and (ii) imply that E(U|X, W) = Wô. 

d. Use (c) and the law of iterated expectations to show that 
nX'My —> 0.x 1. 

e. Use (a) through (d) to conclude that, under assumptions (i) through (iv), 


Ê — B. 


19.10 Let C be a symmetric idempotent matrix. 


a. Show that the eigenvalues of C are either 0 or 1. (Hint: Note that Cq = yq 
implies0 = Cq — yq = CCq — yq = yCq — yq = y’q — yq, and solve 
for y.) 

b. Show that trace(C) = rank(C). 

c. Letdbeann X 1 vector. Show that d'Cd = 0. 
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19.11 Suppose that Cis ann X n symmetric idempotent matrix with rank r, and let 
V ~ N(O,, I). 


a. 


19.12 a. 


b. 


c. 


Show that C = AA’, where A isn X rwith A'A = L, (Hint: C is positive 
semidefinite and can be written as Q A Q’, as explained in Appendix 19.1.) 


. Show that A' V ~ N(0, I). 
. Show that V’CV ~ x2. 


Show that prem is the efficient GMM estimator — that is, that Fe O™ 
in Equation (19.66) is the solution to Equation (19.65). 


Show that Vn (BEroum = BEFOMM) = Pe 


Show that JOM —4> Yack 


19.13 Consider the problem of minimizing the sum of squared residuals, subject to 
the constraint that Rb = r, where Ris q X (k + 1) with rank q. Let B be the 
value of b that solves the constrained minimization problem. 


a. 


Show that the Lagrangian for the minimization problem is 

L(b, y) = (Y — Xb)' (Y — Xb) + y'(Rb — r), where yisaq x 1 
vector of Lagrange multipliers. 

Show that B = Ê — (X'X)'R'[R(X'X)'R'] (RB — r). 

Show that (Y — XB)'(Y — XB) - (Y — XB)(Y — XB) = 

(RB - 1)'[R(X'X)'R'] (RÊ - r). 

Show that F in Equation (19.36) is equivalent to the homoskedasticity- 
only F-statistic in Equation (7.13). 


19.14 Consider the regression model Y = XB + U. Partition X as [X, X>] and B 
as [Bi B3]', where X, has kı columns and X has k, columns. Suppose that 
XY = 0z, x1: Let R = Mk Ok, xe ]- 


a. 
b. 


Show that Ê'(X'X)Ê = (RÊ) [R(X'X) R] (RĜ). 

Consider the regression described in Equation (12.17). 

Lew =[1 W, W, ... W,],where 1isann x 1 vector of 1’s, W, 
is the n X 1 vector with i” element W,,, and so forth. Let UTS denote 
the vector of two stage least squares residuals. 


i. Show that W’U7545 = 0. 


ii. Show that the method for computing the J-statistic described in Key 
Concept 12.6 (using a homoskedasticity-only F-statistic) and that using 
the formula in Equation (19.63) produce the same value for the J-statistic. 
[ Hint: Use the results in (a), (b.i), and Exercise 19.13.] 


19.15 (Consistency of clustered standard errors.) Consider the panel data model 


Yı 


= BX; + a; + un Where all variables are scalars. Assume that assumptions 


19.16 
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1,2, and 4 in Key Concept 10.3 hold and strengthen assumption 3, so that X; 
and u; have eight nonzero finite moments. Let M = Ir — T'u’, where « is a 
T X 1vector of 1’s.Alsolet Y; = (Ya Yo © Yid), X= (Xa X2 Xm, 
U; = (Un Up *** Uir), Y, = MY, X; = MX,, and i; = Mu;. For the 
asymptotic calculations in this problem, suppose that T is fixed and 


n —> ®, 


a. Show that the fixed effects estimator of 8 from Section 10.3 can be written 
as B = (SiXX DX, 
b. Show that Ê — B = ($ XX) XX /u; (Hint: M is idempotent.) 
ce Let Qg = T 1E(X}X,) and Oz = 1s DX Show that Ox — Oy. 
d. Letn; = X/u;/VT and o7 = var(n;). Show that Vis" in, —— NO, o;)- 
e. Use your answers to (b) through (d) to prove Equation (10.25); that is, 
show that VnT(B — B) —> N(0, 02/04). 
f. Let CO, clüsired be the infeasible clustered variance estimator, 
computed using the true errors instead of the residuals so that 
T? clustered = De (Xju,)?. Show that T2 clustered > Oo. 
g. Leta, = Y; — BX; and GZ, clustered = ee (XY ui) [this is 
2 
[ Hint: Use an argument like that used in Equation (18.16) to e that 
GA duser T GA clustered > 0, and then use your answer to (f).] 


Equation (10.27) in matrix form]. Show that a. clustered —— T 


This exercise takes up the problem of missing data discussed in Section 9.2. 
Consider the regression model Y, = X; + u;i = 1, ...,n,where all variables 
are scalars and the constant term/intercept is omitted for convenience. 


a. Suppose that the least squares assumptions in Key Concept 4.3 are satis- 
fied. Show that the least squares estimator of £ is unbiased and consistent. 


b. Now suppose that some of the observations are missing. Let J; denote a 
binary random variable that indicates the nonmissing observations; that 
is, J; = 1 if observation iis not missing, and J; = 0 if observation į is miss- 
ing. Assume that {/;, X;, u;} are iid. 


i. Show that the OLS estimator can be written as 
-1 


n =] n n n 
Ê = (Sixx) yea, = (+ (Sixx: ) (Sax) 
i=1 i=1 z =A 


ii. Suppose that data are missing “completely at random” in the sense that 
Pr(J; = 1|X;, ui) = p, where p is a constant. Show that £ is unbiased 
and consistent. 


Suppose that the probability that the i" observation is missing depends 
of X; but not on u; that is, Pr(J; = 1| X;, u;) = p(X). Show that Bis 
unbiased and consistent. 


iii. 


fa 
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iv. Suppose that the probability that the i” observation is missing depends 
on both X; and u; that is, Pr(J; = 1| X;, u) = p(X; uj). Is B unbiased? 
Is Ê consistent? Explain. 

c. Suppose that 8 = 1 and that X; and u; are mutually independent standard 
normal random variables [so that both X; and u; are distributed N(0, 1)]. 
Suppose that J; = 1 when Y; = 0 but that J, = 0 when Y; < 0. Is B 
unbiased? Is Ê consistent? Explain. 


19.17 Consider the regression model in matrix form Y = XB + Wy + U,where X 
and Ware matrices of regressors and B and y are vectors of unknown regression 
coefficients. Let ¥ = MyX and ¥= MyY,where My = I- W(W'W)'W. 
a. Show that the OLS estimators of B and y can be written as 

liw ww] Laer 
v WX W'W! |W'Y 
b. Show that 
fe x'w | 
wx WW 


(X'’MyX) * — (X'MyX) 'X'W(W'W) 


~ L-(WIW)IW'X(X'’MyX) (WW)! + (WW) W'X(X'MyX) XWW W | 


APPENDIX 


19.1 


(Hint: Show that the product of the two matrices is equal to the identity 
matrix.) 

c. Show that Ê = (X’MyX) |X’MyyY. 

d. The Frisch-Waugh theorem (Appendix 6.2) says that Ê= (X'X IY Y. 
Use the result in (c) to prove the Frisch-Waugh theorem. 


19.18 Consider the homoskedastic linear regression model with two regressors, and 
let px, x, = corr(X, X). Show that corr(By, B) — —px,,x, [Equation (6.21)] 
as n increases. 


Summary of Matrix Algebra 


This appendix summarizes vectors, matrices, and the elements of matrix algebra used in Chapter 19. 
The purpose of this appendix is to review some concepts and definitions from a course in 


linear algebra, not to replace such a course. 


Definitions of Vectors and Matrices 
A vector is a collection of n numbers or elements, collected either in a column (a column 
vector) or in a row (a row vector). The n-dimensional column vector b and the n-dimensional 


row vector c are 
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b= | Sfande=[cy cy +t: Cyl, 
bn 
where b; is the first element of b and, in general, b; is the i™ element of b. 
Throughout, a boldface denotes a vector or matrix. 
A matrix is a collection, or an array, of numbers or elements, in which the elements are 


laid out in columns and rows. The dimension of a matrix is n X m, where n is the number of 


rows and m is the number of columns. The n X m matrix A is 


41 An tt 41m 
ed ta e te 
anı an2 A Anm 


where a;j is the (i, j) element of A; that is, a; is the element that appears in the i” row and j® 
column. An n X m matrix consists of n row vectors or, alternatively, of m column vectors. 
To distinguish one-dimensional numbers from vectors and matrices, a one-dimensional 


number is called a scalar. 


Types of Matrices 


Square, symmetric, and diagonal matrices. A matrix is said to be square if the number of 
rows equals the number of columns. A square matrix is said to be symmetric if its (i,j) element 
equals its (j,i) element. A diagonal matrix is a square matrix in which all the off-diagonal ele- 


ments equal 0; that is, if the square matrix A is diagonal, then a; = 0 fori # j. 


Special matrices. An important matrix is the identity matrix, Z„, which is an n X n diagonal 
matrix with 1’s on the diagonal. The null matrix, 0,,,.,,,, is the n X m matrix with all elements 


equal to 0. 


The transpose. The transpose of a matrix switches the rows and the columns. That is, the 
transpose of a matrix turns the n X m matrix A into the m X n matrix, which is denoted by 
A’, where the (i,j) element of A becomes the (j, i) element of A’; said differently, the transpose 
of the matrix A turns the rows of A into the columns of A’. If aj is the (i, j) element of A, then 
A’ (the transpose of A) is 


ti an ânı 
A= 42 an n2 
Aim Am ane Anm 


The transpose of a vector is a special case of the transpose of a matrix. Thus the transpose of 
a vector turns a column vector into a row vector; that is, if b is an n X 1 column vector, then 


its transpose is the 1 X n row vector: 
b'= [bi bp > By]. 


The transpose of a row vector is a column vector. 
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Elements of Matrix Algebra: Addition 
and Multiplication 


Matrix addition. Two matrices A and B that have the same dimensions (for example, that are 
both n X m) can be added together. The sum of two matrices is the sum of their elements; that 
is,ifC = A + B,thencj = aj + bj.A special case of matrix addition is vector addition: If a and 
b are both n X 1 column vectors, then their sum, c = a + b, is the element-wise sum; that is, 
Ci = a, + b; 


Vector and matrix multiplication. Leta and b be twon X 1 column vectors. Then the product 
of the transpose of a (which is itself a row vector) and bis a'b = >~,a;b;. Applying this defi- 
nition with b = a yields a'a = >j-) a}. 

Similarly, the matrices A and B can be multiplied together if they are conformable—that 
is, if the number of columns of A equals the number of rows of B. Specifically, suppose that A 
has dimension n X m and B has dimension m X r. Then the product of A and Bis ann X r 
matrix, C; that is, C = AB, where the (i, j) element of C is cy = D/=14;b,;. Said differently, 
the (i, j) element of AB is the product of multiplying the row vector that is the i row of A by 
the column vector that is the j™ column of B. 

The product of a scalar d with the matrix A has the (i,j) element da;;; that is, each element 
of A is multiplied by the scalar d. 


Some useful properties of matrix addition and multiplication. Let A and B be matrices. Then 


a A+B=B+A4; 


b. (A+ B)+C=A+ (B+ OC); 


ce (A+B) =A +B 

d. If Aisn X m, then AJ,, = A and LA = A; 
e. A(BC) = (AB)C; 

f. (A + B)C = AC + BC; and 

g (AB)' = B'A'. 


In general, matrix multiplication does not commute; that is, in general AB # BA, 
although there are some special cases in which matrix multiplication commutes; for example, 
if A and B are bothn X n diagonal matrices, then AB = BA. 


Matrix Inverse, Matrix Square Roots, 

and Related Topics 

The matrix inverse. Let A be a square matrix. Assuming that it exists, the inverse of the 
matrix A is defined as the matrix for which A7'A = J,,. If, in, fact the inverse matrix A~! 


exists, then A is said to be invertible or nonsingular. If both A and B are invertible, then 
(AB)! = BAT, 
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Positive definite and positive semidefinite matrices. Let V be an n X n square matrix. 
Then V is positive definite if c’ Vc > 0 for all nonzero n X 1 vectors c. Similarly, V is positive 
semidefinite if c' Vc = 0 for all nonzero n X 1 vectors c. If V is positive definite, then it is 


invertible. 


Linear independence. Then X 1 vectors a, and a, are linearly independent if there do not 
exist nonzero scalars cı and cy such that cja) + cza) = 0,,1. More generally, the set of k vectors 


a, @, . . . , a is linearly independent if there do not exist nonzero scalars c1, c2, ... , Cg such 


that cja, Fla +++ + Cpa = 0x1- 


The rank of a matrix. The rank of the n X m matrix A is the number of linearly independent 
columns of A. The rank of A is denoted rank(A). If the rank of A equals the number of col- 
umns of A, then A is said to have full column rank. If the n X m matrix A has full column rank, 
then there does not exist a nonzero m X 1 vector c such that Ac = 0„x1. IfA isn X n with 
rank(A) = n, then A is nonsingular. If the n X m matrix A has full column rank, then A'A is 


nonsingular. 


The trace of a matrix. The trace of the n X n (square) matrix A is the sum of the diagonal ele- 
ments; that is, trace(A) = $;—1a;. For n X n matrices A and B and n X 1 vector c, the trace 
satisfies these properties: trace(A) = trace(A’), trace(A + B) = trace(A) + trace(B), 
trace(AB) = trace(BA), trace(BAB") = trace(A), and e’Be =trace(Bec’). 


The matrix square root. Let V be ann X n square symmetric positive definite matrix. The 
matrix square root of V is defined to be ann X n matrix F such that F’F = V.The matrix 
square root of a positive definite matrix will always exist, but it is not unique. The matrix 
square root has the property that FV 'F’ = I,. In addition, the matrix square root of a posi- 


tive definite matrix is invertible, so F’ 'VF | = I. 


Eigenvalues and eigenvectors. Let A be ann X n matrix. If then X 1 vector q and the scalar 
A satisfy Aq = Aq, where q'q = 1, then A is an eigenvalue of A, and q is the eigenvector of A 
associated with that eigenvalue. An n X n matrix has n eigenvalues, which need not take on 
distinct values, and eigenvectors. 

If Visann X nsymmetric positive definite matrix, then the eigenvalues of V are positive 
real numbers, and the eigenvectors of V are real. Also, V can be written in terms of its eigen- 
values and eigenvectors as V = QAQ’, where A is a diagonal n X n matrix with diagonal 
elements that equal the eigenvalues of V and Q is ann X n matrix consisting of the eigenvec- 
tors of V, arranged so that the i column of Q is the eigenvector corresponding to the eigen- 
value A;, which is the jth diagonal element of A. The eigenvectors are orthonormal, so 
Q'O = I, The trace of V equals the sum of its eigenvectors: trace(V) = trace(QAQ’) = 
trace(AQ'Q) = trace(A) = >}-4A;. 


Idempotent matrices. A matrix C is idempotent if C is square and CC = C.If Cisann X n 
idempotent matrix that is also symmetric, then C is positive semidefinite, and C has r eigen- 


values that equal 1 and n — r eigenvalues that equal 0, where r = rank(C) (Exercise 19.10). 
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Multivariate Distributions 


This appendix collects various definitions and facts about distributions of vectors of random 
variables. We start by defining the mean and covariance matrix of the n-dimensional random 
variable V. Next we present the multivariate normal distribution. Finally, we summarize some 
facts about the distributions of linear and quadratic functions of jointly normally distributed 


random variables. 


The Mean Vector and Covariance Matrix 


The first and second moments of an m X 1 vector of random variables, V = 
(Vi V2 +++ Vm)’, are summarized by its mean vector and covariance matrix. 

Because V is a vector, the vector of its means—that is, its mean vector—is E(V) = py. 
The i" element of the mean vector is the mean of the i” element of V. 

The covariance matrix of V is the matrix consisting of the variance var(V;),i = 1,...,m, 
along the diagonal and the (i, j) off-diagonal elements cov(V;, V;). In matrix form, the covari- 
ance matrix V is 

var(V;) ss cov(Vi, Vin) 
Sy = EKV - wy)(V - py)'] = i (19.72) 
cov(Vin, Vi) e var(Vin) 


The Multivariate Normal Distribution 


Them X 1 vector random variable V has a multivariate normal distribution with mean vector 


Py and covariance matrix & y if it has the joint probability density function 


1 


V (2m)"det(Zy) a 


where det(È y) is the determinant of the matrix $y. The multivariate normal distribution is 
denoted N(py, Èy). 


An important fact about the multivariate normal distribution is that if two jointly nor- 


AV) = SV- YEV- ah 097) 


mally distributed random variables are uncorrelated (or, equivalently, have a block-diagonal 
covariance matrix), then they are independently distributed. That is, let V, and V> be jointly 
normally distributed random variables with respective dimensions mı X 1and m, X 1.Then 
if cov( V1, V2) = E[(Vi — py,)(V2 -— By,)'] = On, xm, Vi and V> are independent. 

If {V} are iid. N(0, o2), then $y = ø? In, and the multivariate normal distribution simpli- 


fies to the product of m univariate normal densities. 


Distributions of Linear Combinations and Quadratic 
Forms of Normal Random Variables 


Linear combinations of multivariate normal random variables are themselves normally distrib- 


uted, and certain quadratic forms of multivariate normal random variables have a chi-squared 
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distribution. Let V be an m X 1 random variable distributed N(uy, Sy), let A and B be non- 


random a X m and b X m matrices, and let d be a nonrandom a X 1 vector. Then 


d + AV is distributed N(d + Amy, AÈ yA'), and (19.74) 

cov (AV, BV) = AX vB’; (19.75) 

if AX yB’ = 0, 5, then AV and BV are independently distributed; and (19.76) 
(V — py)'=7(V — py) is distributed y%.. (19.77) 


Let U be an m-dimensional multivariate standard normal random variable with distribution 


N(0,I,,). If C is symmetric and idempotent, then 
U'CU has a x? distribution, where r = rank(C). (19.78) 


Equation (19.78) is proven as Exercise 19.11. 


Derivation of the Asymptotic 
Distribution of B 


This appendix provides the derivation of the asymptotic normal distribution of Vn(B — B) 
given in Equation (19.12). An implication of this result is that Ê Bp. 

First consider the “denominator” matrix X'X/n = 7>/-:X;X} in Equation (19.15). The 
(j, 2) element of this matrix is 4X44 X;;X;;. By the second assumption in Key Concept 19.1, 
X; is i.i.d., so X;Xy is i.i.d. By the third assumption in Key Concept 19.1, each element of 
X; has four moments, so, by the Cauchy-Schwarz inequality (Appendix 18.2), X;X); has two 
moments. Because X;;X7; is i.i.d. with two moments, oe XjiXı obeys the law of large 
numbers, so 451 Xi Xi > E(Xji Xu). This is true for all the elements of X'X/n, so 
X'X/n > E(X:X}) = Qy. 

Next consider the “numerator” matrix in Equation (19.15), X’U/ Vn = VIS V, 
where V; = Xju;. By the first assumption in Key Concept 19.1 and the law of iterated expecta- 
tions, E(V;) = E[X;E(u;:|X)] = 0%+1. By the second least squares assumption, V; is i.i.d. Let 
c be a finite k + 1 dimensional vector. By the Cauchy-Schwarz inequality, 
E| (C VY] = E| (c' Xu”) = E| (CXF uy] = VE (cX ]E(u$), which is finite by the 
third least squares assumption. This is true for every such vector c,so E(V;V}) = Xy is finite 
and, we assume, positive definite. Thus the multivariate central limit theorem of Key Concept 


19.2 applies to VIV; = XU: that is, 


1 i 
— X'U > N(Op41, Ey). (19.79) 


Vn 


The result in Equation (19.12) follows from Equations (19.15) and (19.79), the consistency of X'X / n, 


the fourth least squares assumption (which ensures that (¥'X)! exists), and Slutsky’s theorem. 
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Derivations of Exact Distributions of OLS 
Test Statistics with Normal Errors 


This appendix presents the proofs of the distributions under the null hypothesis of the 
homoskedasticity-only t-statistic in Equation (19.35) and the homoskedasticity-only F-statistic 
in Equation (19.37), assuming that all six assumptions in Key Concept 19.1 hold. 


Proof of Equation (19.35) 


If (i) Z has a standard normal distribution, (ii) W has a y2, distribution, and (iii) Z and W are 
independently distributed, then the random variable Z/\V W/m has the t distribution with m 
degrees of freedom (Appendix 18.1). To put 7 in this form, notice that $; = (s3 [E Alx- 
Then rewrite Equation (19.34) as 


= (Ê; — B/V pj 
VW/(n — k — 1) 


where W = (n — k - 1)(s3 /o2), and let Z = (Ê; — Bio) /V Cp ix) andm =n- k- 1. 
With these definitions, T = Z/\V W/m. Thus, to prove the result in Equation (19.35), we 
must show (i) through (iii) for these definitions of Z, W, and m. 


(19.80) 


i. An implication of Equation (19.30) is that, under the null hypothesis, Z = 
(Ê; — Bjo)/ V C&A |x); has an exact standard normal distribution, which shows (i). 
ii. From Equation (19.31), W is distributed as y7_;—1, which shows (ii). 


iii. To show (iii), it must be shown that Ê; and sz are independently distributed. 


From Equations (19.14) and (19.29), Ê — B = (X'X)'X'U and s4 = (MxU)' (MxU)/ 
(n — k — 1). Thus Ê — Band s are independent if (¥'X)1X'U and MyU are independent. 
Both (X’X) 1X’U and MyU are linear combinations of U, which has an N(0,, x1, o7,,) distribu- 
tion, conditional on X. But because MyX(X'X)! = On x (4 1) [Equation (19.26)], it follows that 
(X'X) 1X’U and MyU are independently distributed [Equation (19.76)]. Consequently, under 


all six assumptions in Key Concept 19.1, 
Ê and s} are independently distributed, (19.81) 


which shows (iii) and thus proves Equation (19.35). 


Proof of Equation (19.37) 


The F,,,,n, distribution is the distribution of (W,/n,)/(W2/nz), where (i) W; is distributed 
X (ii) W; is distributed X and (iii) W, and W, are independently distributed (Appendix 
18.1). To express F in this form, let W, = (RB = r)'[R(X’X)'R'o2] (RB —r) and 
W, = (n — k — 1)s2 /o2. Substitution of these definitions into Equation (19.36) shows that 
F = (W,/q)/[Ws/(n — k — 1)]. Thus, by the definition of the F distribution, F has an 
F}, n-k- distribution if (i) through (iii) hold with ny = q and m =n — k - 1. 
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i. Under the null hypothesis, RÊ -r= R(Ê — B). Because B has the conditional normal 
distribution in Equation (19.30) and because R is a nonrandom matrix, R(B — B) is dis- 
tributed N(0, 5.1, R(X 'X) 'R'o2), conditional on X. Thus, by Equation (19.77) in Appen- 
dix 19.2, (RB — r)'[R(X'X)R'o2] (RB — r) is distributed Xe proving (i). 

ii. Requirement (ii) is shown in Equation (19.31). 

iii. It has already been shown that Ê- B and s% are independently distributed 
[Equation (19.81)]. It follows that RB — rand sz are independently distributed, which in 
turn implies that W, and W, are independently distributed, proving (iii) and completing 


the proof. 


Proof of the Gauss—-Markov Theorem 
for Multiple Regression 


This appendix proves the Gauss—Markov theorem (Key Concept 19.3) for the multiple regres- 
sion model. Let B be a linear conditionally unbiased estimator of B so that B = A'Y and 
E(B |X) = B, where A is ann X (k + 1) matrix that can depend on X and nonrandom con- 
stants. We show that var(c’ B) = var(c’ B) for all k + 1 dimensional vectors c, where the 
inequality holds with equality only if B = Ê. 

Because f is linear, it can be written as B = A'Y = A'(XB 4 U) = (AX)B + AU. 
By the first Gauss-Markov condition, E(U|X) = 0„x1, so E(B |X) = (A'X)B, but because 
B is conditionally unbiased, E(B |X) = B = (A'X)B, which implies that A'X = I,,,. Thus 
B = B + AU, so var(B |X) = var(A'U|X) = E(A'UU'A|X) = A'E(UU'|X)A = 02 A’A, 
where the third equality follows because A can depend on X but not U and the final equality 


follows from the second Gauss—Markov condition. That is, if B is linear and unbiased, then 


under the Gauss—Markov conditions, 
AX = I,,, and var(B|X) = 02 AA. (19.82) 


The results in Equation (19.82) also apply to Ê with A = A = X(X'X)!, where (X'X) 1 exists 
by the third Gauss—Markov condition. 

Now let A = A + D, so that D is the difference between the matrices A and A. 
Note that A’A = (X’X)'X’A = (X’X)! [by Equation (19.82)] and A’A = 
(X'X) X'X(X'Xy| = (XX, so A'D = A'(A — A) = Â'A — Â'Â = G4 px&+ 1) 
Substituting A = A + D into the formula for the conditional variance in Equation (19.82) 


yields 
var(B|X) = 02(A + D)'(A + D) 
o2[A'A + Â'D + D'Â + D'D] 
= 07 (X'X)! + o2D'D, (19.83) 


where the final equality uses the facts A'A = (X'X)! and ÂD = O41) x(k+ 1): 
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Because var(B |X) = o°(X'X)+, Equations (19.82) and (19.83) imply that 
var(B |X) — var(B |X) = o2D'D.The difference between the variances of the two estimators 


of the linear combination c’B thus is 
var(c’B|X) — var(c’B|X) = ofc'D'De = 0. (19.84) 


The inequality in Equation (19.84) holds for all linear combinations c’B, and the inequality 
holds with equality for all nonzero ¢ only if D = 0,441) —that is, if A = A or, equivalently, 
B = B. Thus cÊ has the smallest variance of all linear conditionally unbiased estimators of 
c' B; that is, the OLS estimator is BLUE. 


Proof of Selected Results for IV 
and GMM Estimation 


The Efficiency of TSLS Under Homoskedasticity 
[Proof of Equation (19.62)] 


When the errors u; are homoskedastic, the difference between $4 [Equation (19.61)] and 
7555 [Equation (19.55)] is given by 


EY — DS = (QyzAQzx) 'OxzAQ77AO7x(Ox7zAQzy) '0%, — (OxzO77O7x) 07, 
= (Oy7zAQzx) 'OxzAlOzz — Ozx(Ox7zO77 zx) 'Oxz|AQzx(QxzAQzx) ‘07, (19.85) 


where the second term within the brackets in the second equality follows from 
(OyzAQzyx) 'OyzAQzy = Te+1++1). Let F be the matrix square root of Q7z, so Qzz = F'F 
and Q7) = F!F". [The latter equality follows from noting that (F'F)! = FF'™! and 
F'| = F |] Then the final expression in Equation (19.85) can be rewritten to yield 


EY — ETS = (OyzAQzx) 'OxzAF [I — F" Ozx(QxzF 'F'Ozx) 'OyzF "| 
x FAQzy(OxzAQzx) ‘07. (19.86) 


where the second expression within the brackets uses F'F ~" = I. Thus 
c' (LY — &P)e = d'[I — D(D'D) 'D']do?, (19.87) 


where d = FAQzx(QxzAQzyx) '¢ and D = F" Qzy. Now I— D(D'D)"'D’ is a symmetric 
idempotent matrix (Exercise 19.5). As a result, J — D(D'D)'D' has eigenvalues that are 
either 0 or 1, and d'[I — D(D'D) 'D']d = 0 (Exercise 19.10). Thus c'($¥ — £7")e = 0, 
proving that TSLS is efficient under homoskedasticity. 
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Asymptotic Distribution of the J-Statistic Under 
Homoskedasticity 


The J-statistic is defined in Equation (19.63). First note that 


U = Y- XBTLS 
= Y — X(X'P;X) '|X'PzY 
= (XB + U) — X(X'P7X)'X'P,(XB + U) 
= U — X(X'P7X)' X'PZU 
= [I — X(X'P2X) 1X'Py]U. (19.88) 


Thus 


UP,U = U'[I — P;X(X'P;X) 'X’|P,[I — X(X'P;X) |X'Pz|U 
= U'[Pz — PzX(X'PzX)'X'Pz]U, (19.89) 


where the second equality follows by simplifying the preceding expression. Because Z'Z is 
symmetric and positive definite, it can be written in terms of its matrix square root, 
Z'Z = (Z'Z)'?(Z'Z)'/?, and this matrix square root is invertible, so (Z'Z)! 
(Z'Z)/*(Z'Z)/?", where (Z'Z) 1/2 = [(Z’Z)'/7]"!. Thus Pz can be written as Pz = 
Z(Z'Z)'Z' = BB' where B = Z(Z'Z)'/?. Substituting this expression for Pz into the final 
expression in Equation (19.89) yields 


U'P,U = U'[BB' — BB'X(X'BB'X)' X'BB'|U 
= U'B[I — B'X(X'BB'X) 'X'B)B'U 
= U'BMp'yB'U, (19.90) 


where Mg'x = I — B'X(X'BB'Xy'X'B is a symmetric idempotent matrix. 

The asymptotic null distribution of Û'P7Û is found by computing the limits in probability 
and in distribution of the various terms in the final expression in Equation (19.90) under the 
null hypothesis. Under the null hypothesis that E(Z;u;) = 0, Z'U/Vn has mean 0, and 
the central limit theorem applies, so Z'U/Vn ae N(0, Qzz07,). In addition, 
Z'Z/n > Qzz and X'Z/n > Qyz. Thus B'U = (Z'Z)'/'Z'U = (Z'Ziny 
(Z'U/Vn) 1 g,,z, Where z is distributed N(0,,+,+1, Im+r+1). In addition, B'X/Vn = 
(Z'Z/n)?(Z'X/n) — O7'Ozx, so Mprx —— I- O77 Ozx(Oxz 07% 07% Qzx)` 
Ox7074° = Mozon Thus 


Û'PzÛ —> (z'Mo,, 03} 2)02. (19.91) 


Under the null hypothesis, the TSLS estimator is consistent, and the coefficients in the regres- 
sion of Û on Z converge in probability to 0 [an implication of Equation (19.91)], so the denom- 


inator in the definition of the J-statistic is a consistent estimator of 07: 


U'M,U/(n — m - r - 1) => o. (19.92) 
From the definition of the J-statistic and Equations (19.91) and (19.92), it follows that 


U'P,U r 
T= mN p l Mog out (19.93) 
Z nA—-m— r— 
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Because z is a standard normal random vector and Mọ7/2ọ;y is a symmetric idempotent 
matrix, J is distributed as a chi-squared random variable with degrees of freedom that equals 
the rank of Mo71/2ọ,, [Equation (19.78)]. Because OF Ore is(m+r+1)X (kK +r+1) 
and m > k, the rank of Mox ism — k [Exercise 19.5]. Thus J —L X-k, Which is the result 
stated in Equation (19.64). 


The Efficiency of the Efficient GMM Estimator 
The infeasible efficient GMM estimator, pee , is defined in Equation (19.66). The proof 


that eo is efficient entails showing that c'($4 — Y“/°"™)¢ = 0 for all vectors c. The 
proof closely parallels the proof of the efficiency of the TSLS estimator in the first section of 
this appendix, with the sole modification that H~! replaces Q7702 in Equation (19.85) and 


subsequently. 


Distribution of the GMM J-Statistic 


The GMM J-statistic is given in Equation (19.70). The proof that, under the null hypothesis, 
yom $, Xa- closely parallels the corresponding proof for the TSLS J-statistic under 


homoskedasticity. 


Regression with Many Predictors: 
MSPE, Ridge Regression, and Principal 
Components Analysis 


This appendix presents the derivations for various results used in Chapter 14 that rely on 


matrix calculations. 


The MSPE for Linear Regression Estimated by OLS 
We first derive Equation (14.4), the mean squared prediction error (MSPE) of the OLS esti- 


mator under homoskedasticity. 
Let the k X 1 vector X°® denote the values of the X’s for the out-of-sample observation 
(“oos”) to be predicted. With this notation, the MSPE in Equation (14.3), written using matrix 


notation, is 
MSPE = o? + E[(B — B)'X°*)?, (19.94) 
where B denotes any estimator of B, not just the OLS estimator. 
Under the least squares assumptions for prediction, the out-of-sample observation is 


assumed to be an i.i.d. draw from the same population as the estimation sample. Under this 


assumption, the MSPE in Equation (19.94) can be written 


MSPE = ø? + trace{E[(B — B)(B — B)']Ox}, (19.95) 
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where Qy = E(X'X). Equation (19.95) follows from Equation (19.94) by 
writing, E[(B — B)'X°"]? = ELX” (Ê — B)(B — P)'X°”] = traceE[(B — B)(B — B)' 
XOX") = traceE| (Ê - BÊ — B)'|Qy, where the second inequality uses the property of 
the trace that a'Ba = trace(Baa') forn X n matrix B andn X 1 vector a and where the final 
equality uses the assumptions that the out-of-sample observation is independent of the estima- 
tion observations and that it is drawn from the same distribution, so that E(X°°X°°") = Qy. 

The MSPE for OLS obtains by substituting the expression for OLS in Equation (19.14) 
into Equation (19.95) and simplifying. First note that, under the assumption of homoskedastic- 
ity, for the OLS estimator, 


E| (X'X) 1X'uu'X(X'X) 1] 
EL (X'X) 'X'E(uu' |X)X(X'X)"] 
E[ (X'X)'X'X(X'X) | Joy, = El (X'X) oz, 


E((B — B)(B - B)'] 


where the first equality uses Equation (19.14); the second equality uses the law of iterated 
expectations; the third equality uses the assumption of homoskedasticity, so E(uu' |X) = o71,; 
and the final equality simplifies. Substitution of E(B = BB — B)'| = E[ (X'Xy"']o? into 
Equation (19.95) and multiplying and dividing the second term by 1/n yields 


1 x'xy\! 
MSPEors = o2 + L trace{ el ( ) lox} (19.96) 
n n 


Equation (19.96) is the MSPE for a prediction made using the OLS estimator under the 
least squares assumptions for prediction with homoskedastic errors. 

Equation (14.4) is an approximation to Equation (19.96) when n is large relative to k. In that 
case, X’X/n = Qy (specifically, for fixed k,X'X/n ——> Qy)sotrace{ E[(X'X/n)']Ox} = 
trace{ Q¥'Ox} = trace{f,} = k. Substitution of this final expression into Equation (19.96) 
and collecting terms yields Equation (14.4): 


k 
MSPEors = (1 + oi (19.97) 
n 


Connection to the final prediction error (FPE). Equation (19.97) is used in the derivation of 
the final prediction error (FPE) for time series forecasting given in Equation (15.21) (with a 
change in notation so that n is replaced by T and k is replaced by p + 1). The key difference 
between the cross-section and time-series cases is the relation of the out-of-sample observa- 
tion to the in-sample observations. In the deriviation here, the in- and out-of-sample observa- 
tions are independent. If the values of the predictors in the time series application are 
independent of the data used to estimate the coefficients, then the derivation here applies 
directly. Typically this will not be the case, however, because the final observations in the 
sample (the ones used to make the out-of-sample forecast) are correlated with the in-sample 
observations. If the sample size is large, however, then the dependence between the estimated 
regression coefficients and the out-of-sample predictors is small, so Equation (19.97) still holds 


as an approximation when the sample size is large relative to the number of regressors. 
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Equation (14.8) provides an expression for the ridge regression estimator with a single regressor. 
This appendix derives an expression for the case of multiple regressors. 
The ridge regression estimator minimizes the penalized sum of squared residuals in Equa- 


tion (14.7), written here using matrix notation: 
SRi48e(b: riage) = (Y — Xb)'(Y — Xb) + Apidgeb'b. (19.98) 


Taking the derivative of the right-hand side of Equation (19.98) and setting it to 0 yields 
the system solved by the ridge regression estimator BRiase, —2x'(Y — XBRidse) + 2X ridge 48° 
= 0 [cf. Equations (19.9) and (19.10) for OLS)]. Solving this system yields the formula for the 


ridge regression estimator, 
BPitee = (X'X + Ariagels) X’Y. (19.99) 


We note two implications of this formula that are discussed in Sections 14.3 and 14.4, 
respectively. 

First, if the regressors are uncorrelated in the estimation sample, the ridge regression 
estimator can be written as the OLS estimator, shrunk toward 0 by a factor that depends on the 
data, that is, BR’ = (1 + aRidse/ 7x2) 1B, which is Equation (14.8). Moreover, if in 
addition the regressors are standardized using the sample standard deviation, as they are in 
the empirical work in Chapter 14, that shrinkage factor simplifies to [1 + A®!48¢/(m — 1)]71. 
To show these results, note that if the regressors are uncorrelated, then X'X is diagonal, 
so that X'X + Apidgel, is diagonal with jm diagonal element -Xi + ykidse Then 
Equation (19.99) simplifies, so that the ridge estimator of the j" coefficient B; is 
BR = (EG + NMED XH = (1 + aS KY (Dia XW) De XH = 
(1 + Aiie Si 1X3) 1B, where Ê is the OLS estimator for these uncorrelated regressors. 
Thus, with uncorrelated regressors, the ridge regression estimator shrinks the OLS estimator 
toward 0 by the factor (1 + atse SP 1X7) |. If in addition the regressors are standardized 
using the sample standard deviation, then >/- 1X5 = n — 1, in which case pE: = 
[1 + aids (n — 1) 18. 

Second, as is discussed in Section 14.4, predictions made using the ridge regression estima- 
tor, in general, change if different linear combinations of the regressors are used as predictors. 
Specifically, if X denotes the matrix of predictors, then the ridge predictions made using X and 
using XA differ, where A is a nonsingular k X k matrix. This is an important difference 
between ridge and OLS because OLS yields the same predictions whether X or XA is used. 

To show this result, consider the ridge regression estimator computed using XA, and 


Ridge 


denote that estimator by BRs¢ In this notation, the ridge regression estimator computed using 


X without the linear transformation is pise. The same linear transformation must be applied 
to the out-of-sample and in-sample predictors, so the transformed out-of-sample observation 
is A'X?OS, Thus the out-of-sample predicted value using BRidse is ÑOS = (A'X9%%)' BRidse = 
XO08' 4 ĝRidse, In this notation, the out-of-sample predicted value using the original 
regressors X is POS = YOOS BRidee From Equation (19.99), the ridge estimator 


is BRi4e° =[ (XA)' (XA) + Ariagelk] (XA)'Y = (A'X'XA + Apridgelk) 'A'XY = [A'(X'X + 
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ÀRidgeA "AA J'A 'XY = AŤ7[X'X + ARidg(AA') |] "XY, where the equalities follow by 
collecting terms using the properties of matrix inverses. Thus the ridge prediction for the 
out-of-sample observation is ÑOS = YOOs' A BRidge = XOO'[X'X + Nridge(AA') J XY, 
whereas using the X’s without the linear rotation yields the prediction Yes = 
XOOS'(X'X +A Ridgelk) XY . The two predictions differ because the matrix (AA’)! appears 
in the expression for ¥2°S but not in the expression for PLOS, The only time that a linear 
transformation A does not change the ridge predicted value is when the linear transformation 
is orthonormal—that is, when AA’ = Iņ, so that (AA ') ~}! = L. 

To see that OLS produces the same predicted value, regardless of the linear transforma- 
tion A (as long as A is nonsingular), note that the OLS predicted value is the ridge predicted 
value when Apidge = 0. The result follows from substituting Ridge = Q into the expressions for 


the ridge predictions ¥ 2° and Y?°S in the previous paragraph. 


Principal Components Analysis 


This section presents formulas for the principal components of X and shows that the sum of 
the variances of the principal components equals the sum of the variances of the X’s [Equation 
(14.10)]. The section concludes with an expression for the out-of-sample prediction, computed 
using the first r principal components, as in Section 14.5, expressed in terms of the out-of- 
sample values of the predictors, X°?%. 

In Key Concept 14.2, the j"" principal component of X is defined to be the linear combina- 
tion of X such that (a) the squared weights of the linear combinations sum to 1; (b) the j} principal 
component is uncorrelated with the previous j — 1 principal components; and (c) the j® principal 
component maximizes the variance of the linear combination, subject to (a) and (b). We now state 
these criteria mathematically and use them to derive explicit formulas for the principal compo- 
nents. In particular, we show that the linear combination weights used to form the first r principal 
components are the eigenvectors of X'X corresponding to its r largest eigenvalues. 

Let PC; denote the j™ principal component, and let W; denote the k x 1 vector of weights 
used to construct PC;, so that PC; = XW,. The sum of squares of PC; is PC) PC; = W;X'XW,, 
and the sum of squares weights is W; W;. Because X has mean 0 (the X’s are standardized), PC; 
PC; /(n — 1) is the sample variance of the j™ principal component. The weights W; are chosen 


to solve 
maxw, PC; PC; = W;X'XW, subject to WW; = 1 and PC;PC; = 0 fori < j. (19.100) 


For j = 1, the constrained maximization problem is to choose W, to maximize W,'X'XW, 
subject to W,’W, = 1. This constrained maximization is done by maximizing the Lagrangian, 
Wi X' XW, — à (W? Wi — 1), where A, is the Lagrange multiplier. Taking the derivative of the 
Lagrangian with respect to W, and setting it to 0 yields 


X'XW, = AW. (19.101) 


Equation (19.101) shows that W; is an eigenvector of X'X and A, is its corresponding 


eigenvalue, where the eigenvector is normalized to have unit length. Moreover, multiplying 
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both sides of Equation (19.101) by W,’ shows that W'X'XW, = PC,'PC, = à,so that maxi- 
mizing PC,'PC, requires that A, be the largest eigenvalue of X'X and that W; be the eigenvec- 
tor of X'X corresponding to the largest eigenvalue. 

Now consider W2. There are two constraints, Wz W, = 1 and PC;PC,; = W7 X'XW, =0, 
so the Lagrangian is W% X'XW, — A.(W7 W, — 1) — ynu Wz X'XW,, where A, and yz are 
Lagrange multipliers. Taking the derivative of the Lagrangian with respect to W, and setting 
it to 0 yields 


1 
X'XW, = AW, + Ya X' XW. (19.102) 


First note that multiplying both sides of Equation (19.101) by W,’ yields Wi X'XW, = 
A.W W; because W7 X'XW, = 0,it follows that W7 W, = 0. Now multiplying both sides of 
Equation (19.102) by W,' yields W X'XW, = A.W W, + ya W! X'XW, = 4y W X'XW,, 
but because W! X'XW, = W? W, = 0, it must be that y2; = 0. Thus Equation (19.102) 
reduces to X’XW, = àW», so that W; is an eigenvector of X'X and Az is its corresponding 
eigenvalue. Multiplying both sides of X'XW, = à W, by W,’ and imposing the unit normaliza- 
tion yields W7 X'X W, = Ap. Thus, the Lagrangian is maximized by choosing W, to be the 
eigenvector corresponding to the largest of the remaining eigenvalues — that is, to the second- 
largest eigenvalue of X’X. 

Continuing, these calculations shows that W; is the unit-length eigenvector of X'X associ- 
ated with A,, the j™-largest eigenvalue of X'X; that PC/ PC; = A; and that PC/ PC; = 0 for 
i # j.Ifk < n, only the first k eigenvalues of X'X are nonzero, so the total number of princi- 
pal components is min(n, k). 


Because the trace of a matrix is equal to the sum of its eigenvalues, 


min(n,k) min(n,k) 


trace(X’X) = X y= PC} PC;. (19.103) 
jal j= 


J 


Dividing the first and last expressions in Equation (19.103) by n — 1 yields Equation (14.10). 

Finally, we provide an expression for the out-of-sample prediction in terms of the out- 
of-sample value of the predictors, ¥9°°. The first r out-of-sample values of the principal 
components are PCS = [PCP°S PCP -.. PCPS] = Wi X°°%, where 
Wi, =[W, W, -:::  W,|jare the first r eigenvectors of X'X in the estimation sample. Let 
¥ denote the r X 1 vector of OLS coefficients in the regression of Y on the first r principal 
components in the estimation sample. Then the principal components prediction of Y??° is 
YS = 4’ PCPS, Written in terms of the original regressors, the principal components 


prediction is 
POOS = oy ee (19.104) 


This expression was used to compute the entries in Table 14.4 for the principal components 


prediction. 


Appendix 


(Taser | The Cumulative Standard Normal Distribution Function, ®(z) = Pr(Z = z) | 
Area = Pr(Z < z) 
| 
0 z 
Second Decimal Value of z 
z 0 1 p4 3 4 5 6 7 8 9 
—2.9 0.0019 0.0018 0.0018 0.0017 0.0016 0.0016 0.0015 0.0015 0.0014 0.0014 
—2.8 0.0026 0.0025 0.0024 0.0023 0.0023 0.0022 0.0021 0.0021 0.0020 0.0019 
—2.7 0.0035 0.0034 0.0033 0.0032 0.0031 0.0030 0.0029 0.0028 0.0027 0.0026 
—2.6 0.0047 0.0045 0.0044 0.0043 0.0041 0.0040 0.0039 0.0038 0.0037 0.0036 
=2.5 0.0062 0.0060 0.0059 0.0057 0.0055 0.0054 0.0052 0.0051 0.0049 0.0048 
—2.4 0.0082 0.0080 0.0078 0.0075 0.0073 0.0071 0.0069 0.0068 0.0066 0.0064 
—2.3 0.0107 0.0104 0.0102 0.0099 0.0096 0.0094 0.0091 0.0089 0.0087 0.0084 
—2.2 0.0139 0.0136 0.0132 0.0129 0.0125 0.0122 0.0119 0.0116 0.0113 0.0110 
—2.1 0.0179 0.0174 0.0170 0.0166 0.0162 0.0158 0.0154 0.0150 0.0146 0.0143 
—2.0 0.0228 0.0222 0.0217 0.0212 0.0207 0.0202 0.0197 0.0192 0.0188 0.0183 
—1.9 0.0287 0.0281 0.0274 0.0268 0.0262 0.0256 0.0250 0.0244 0.0239 0.0233 
—1.8 0.0359 0.0351 0.0344 0.0336 0.0329 0.0322 0.0314 0.0307 0.0301 0.0294 
= 17 0.0446 0.0436 0.0427 0.0418 0.0409 0.0401 0.0392 0.0384 0.0375 0.0367 
—1.6 0.0548 0.0537 0.0526 0.0516 0.0505 0.0495 0.0485 0.0475 0.0465 0.0455 
—15 0.0668 0.0655 0.0643 0.0630 0.0618 0.0606 0.0594 0.0582 0.0571 0.0559 
—14 0.0808 0.0793 0.0778 0.0764 0.0749 0.0735 0.0721 0.0708 0.0694 0.0681 
—13 0.0968 0.0951 0.0934 0.0918 0.0901 0.0885 0.0869 0.0853 0.0838 0.0823 
—12 0.1151 0.1131 0.1112 0.1093 0.1075 0.1056 0.1038 0.1020 0.1003 0.0985 
—11 0.1357 0.1335 0.1314 0.1292 0.1271 0.1251 0.1230 0.1210 0.1190 0.1170 
—1.0 0.1587 0.1562 0.1539 0.1515 0.1492 0.1469 0.1446 0.1423 0.1401 0.1379 
L —0.9 0.1841 0.1814 0.1788 0.1762 0.1736 0.1711 0.1685 0.1660 0.1635 0.1611 J 


(Table 1 continued) 
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(Table 1 continued) 


W 


Second Decimal Value of z 


z 0 1 2 3 4 5 


=0.8 0.2119 0.2090 0.2061 0.2033 0.2005 0.1977 
—0.7 0.2420 0.2389 0.2358 0.2327 0.2296 0.2266 
—0.6 0.2743 0.2709 0.2676 0.2643 0.2611 0.2578 
—0.5 0.3085 0.3050 0.3015 0.2981 0.2946 0.2912 
—0.4 0.3446 0.3409 0.3372 0.3336 0.3300 0.3264 
—0.3 0.3821 0.3783 0.3745 0.3707 0.3669 0.3632 
—0.2 0.4207 0.4168 0.4129 0.4090 0.4052 0.4013 
=0.1 0.4602 0.4562 0.4522 0.4483 0.4443 0.4404 
—0.0 0.5000 0.4960 0.4920 0.4880 0.4840 0.4801 
0.0 0.5000 0.5040 0.5080 0.5120 0.5160 0.5199 
0.1 0.5398 0.5438 0.5478 0.5517 0.5557 0.5596 
0.2 0.5793 0.5832 0.5871 0.5910 0.5948 0.5987 
0:3 0.6179 0.6217 0.6255 0.6293 0.6331 0.6368 
0.4 0.6554 0.6591 0.6628 0.6664 0.6700 0.6736 
0.5 0.6915 0.6950 0.6985 0.7019 0.7054 0.7088 
0.6 0.7257 0.7291 0.7324 0.7357 0.7389 0.7422 
0.7 0.7580 0.7611 0.7642 0.7673 0.7704 0.7734 
0.8 0.7881 0.7910 0.7939 0.7967 0.7995 0.8023 
0.9 0.8159 0.8186 0.8212 0.8238 0.8264 0.8289 
1.0 0.8413 0.8438 0.8461 0.8485 0.8508 0.8531 
LI 0.8643 0.8665 0.8686 0.8708 0.8729 0.8749 
1:2 0.8849 0.8869 0.8888 0.8907 0.8925 0.8944 
1:3 0.9032 0.9049 0.9066 0.9082 0.9099 0.9115 
1.4 0.9192 0.9207 0:9222 0.9236 0.9251 0.9265 
1.5 0.9332 0.9345 0.9357 0.9370 0.9382 0.9394 
1.6 0.9452 0.9463 0.9474 0.9484 0.9495 0.9505 
17 0.9554 0.9564 0.9573 0.9582 0.9591 0.9599 
18 0.9641 0.9649 0.9656 0.9664 0.9671 0.9678 
1:9 0.9713 0:9719 0.9726 0.9732 0.9738 0.9744 
2.0 0.9772 0.9778 0.9783 0.9788 0.9793 0.9798 
2.1 0.9821 0.9826 0.9830 0.9834 0.9838 0.9842 
2.2 0.9861 0.9864 0.9868 0.9871 0.9875 0.9878 
2.3 0.9893 0.9896 0.9898 0.9901 0.9904 0.9906 
2.4 0.9918 0.9920 0.9922 0.9925 0.9927 0.9929 
25 0.9938 0.9940 0.9941 0.9943 0.9945 0.9946 
2.6 0.9953 0.9955 0.9956 0.9957 0.9959 0.9960 
2.7 0.9965 0.9966 0.9967 0.9968 0.9969 0.9970 
2.8 0.9974 0.9975 0.9976 0.9977 0.9977 0.9978 
29 0.9981 0.9982 0.9982 0.9983 0.9984 0.9984 


This table can be used to calculate Pr( Z = z) where Z is a standard normal variable. For example, when z = 1.17, this probability 


is 0.8790, which is the table entry for the row labeled 1.1 and the column labeled 7. 
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0.1949 
0.2236 
0.2546 
0.2877 
0.3228 
0.3594 
0.3974 
0.4364 
0.4761 
0.5239 
0.5636 
0.6026 
0.6406 
0.6772 
0.7123 
0.7454 
0.7764 
0.8051 
0.8315 
0.8554 
0.8770 
0.8962 
0.9131 
0.9279 
0.9406 
0.9515 
0.9608 
0.9686 
0.9750 
0.9803 
0.9846 
0.9881 
0.9909 
0.9931 
0.9948 
0.9961 
0.9971 
0.9979 
0.9985 
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0.1922 
0.2206 
0.2514 
0.2843 
0.3192 
0.3557 
0.3936 
0.4325 
0.4721 
0.5279 
0.5675 
0.6064 
0.6443 
0.6808 
0.7157 
0.7486 
0.7794 
0.8078 
0.8340 
0.8577 
0.8790 
0.8980 
0.9147 
0.9292 
0.9418 
0.9525 
0.9616 
0.9693 
0.9756 
0.9808 
0.9850 
0.9884 
0.9911 
0.9932 
0.9949 
0.9962 
0.9972 
0.9979 
0.9985 


0.1894 
0.2177 
0.2483 
0.2810 
0.3156 
0.3520 
0.3897 
0.4286 
0.4681 
0.5319 
0.5714 
0.6103 
0.6480 
0.6844 
0.7190 
0.7517 
0.7823 
0.8106 
0.8365 
0.8599 
0.8810 
0.8997 
0.9162 
0.9306 
0.9429 
0.9535 
0.9625 
0.9699 
0.9761 
0.9812 
0.9854 
0.9887 
0.9913 
0.9934 
0.9951 
0.9963 
0.9973 
0.9980 
0.9986 
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0.1867 
0.2148 
0.2451 
0.2776 
0.3121 
0.3483 
0.3859 
0.4247 
0.4641 
0.5359 
0.5753 
0.6141 
0.6517 
0.6879 
0.7224 
0.7549 
0.7852 
0.8133 
0.8389 
0.8621 
0.8830 
0.9015 
0.9177 
0.9319 
0.9441 
0.9545 
0.9633 
0.9706 
0.9767 
0.9817 
0.9857 
0.9890 
0.9916 
0.9936 
0.9952 
0.9964 
0.9974 
0.9981 
0.9986 
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Critical Values for Two-Sided and One-Sided Tests Using the Student t Distribution 


Significance Level 
Degrees of 20% (2-Sided) 10% (2-Sided) 5% (2-Sided) 2% (2-Sided) 1% (2-Sided) 
Freedom 10% (1-Sided) 5% (1-Sided) 2.5% (1-Sided) 1% (1-Sided) 0.5% (1-Sided) 
1 3.08 6.31 12.71 31.82 63.66 
2 1.89 2.92 4.30 6.96 9.92 
3 1.64 2.35 3.18 4.54 5.84 
4 1.53 2.13 2.78 Byes) 4.60 
5 1.48 2.02 257 3.36 4.03 
6 1.44 1.94 2.45 3.14 3.71 
7 1.41 1.89 2.36 3.00 3.50 
8 1.40 1.86 2.31 2.90 3.36 
9 1.38 1.83 2.26 2.82 325 
10 1.37 1.81 2.23 2.76 3.17 
11 1.36 1.80 2.20 2:72 3.11 
12 1.36 1.78 2.18 2.68 3.05 
13 1.35 1.77 2.16 2.65 3.01 
14 1.35 1.76 2.14 2.62 2.98 
15 1.34 1.75 2.13 2.60 2.95 
16 1.34 1.75 2.12 2.58 2.92 
17 1.33 1.74 211 2.37 2.90 
18 1.33 1.73 2.10 2.55 2.88 
19 133 1.73 2.09 2.54 2.86 
20 1.33 1.72 2.09 2.53 2.85 
21 1332 1.72 2.08 2.52 2.83 
22 1:32 1:72 2.07 2.51 2.82 
23 1.32 L71 2.07 2.50 2.81 
24 1.32 L71 2.06 2.49 2.80 
25 1.32 L71 2.06 2.49 2.79 
26 1.32 1.71 2.06 2.48 2.78 
27 1.31 1.70 2.05 2.47 2.77 
28 1.31 1.70 2.05 2.47 2.76 
29 1.31 1.70 2.05 2.46 2.76 
30 1.31 1.70 2.04 2.46 2.75 
60 1.30 1.67 2.00 2.39 2.66 
90 1.29 1.66 1.99 2.37 2.63 
120 1.29 1.66 1.98 2.36 2.62 
æ 1.28 1.64 1.96 2:33 2.58 
Values are shown for the critical values for two-sided ( # ) and one-sided (>) alternative hypotheses. The critical value for the 
one-sided (<) test is the negative of the one-sided (>) critical value shown in the table. For example, 2.13 is the critical value for a 
two-sided test with a significance level of 5% using the Student ¢ distribution with 15 degrees of freedom. 
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TABLE 3 Critical Values for the x? Distribution 


Significance Level 
Degrees of Freedom 10% 5% 1% 

1 2.71 3.84 6.63 

2 4.61 5.99 9.21 

3 6.25 7.81 11.34 

4 7.78 9.49 13.28 

3 9.24 11.07 15.09 

6 10.64 12.59 16.81 

7 12.02 14.07 18.48 

8 13.36 15.51 20.09 

9 14.68 16.92 21.67 
10 15.99 18.31 23.21 
11 1728 19.68 24.72 
12 18.55 21.03 26.22 
13 19.81 22.36 2769 
14 21.06 23.68 29.14 
15 22.31 25.00 30.58 
16 23.54 26.30 32.00 
17 24.77 2759 33.41 
18 25.99 28.87 34.81 
19 2720 30.14 36.19 
20 28.41 31.41 3757 
21 29.62 32.67 38.93 
22 30.81 33.92 40.29 
23 32.01 35.17 41.64 
24 33.20 36.41 42.98 
25 34.38 3765 44.31 
26 35.56 38.89 45.64 
27 36.74 40.11 46.96 
28 3792 41.34 48.28 
29 39.09 42.56 49.59 
30 40.26 43.77 50.89 

This table contains the 90", 95", and 99" percentiles of the y? distribution. These serve as critical values for tests with significance 
levels of 10%, 5%, and 1%. 


— 
LOS Critical Values for the Fm,» Distribution 


Area = Significance Level 
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Degrees of Freedom 
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i evel of 10%, 5%, and 1%. 


10% 


2.71 
2.30 
2.08 
1.94 
1.85 
1.77 
1.72 
1.67 
1.63 
1.60 
1.57 
1.55 
1.52 
1.50 
1.49 
1.47 
1.46 
1.44 
1.43 
1.42 
1.41 
1.40 
1.39 
1.38 
1.38 
1.37 
1.36 
135 
1.35 
1.34 


T 
Critical Value 


Significance Level 
5% 


3.84 
3.00 
2.60 
2.37 
2.21 
2.10 
2.01 
1.94 
1.88 
1.83 
1.79 
1:75 
1.72 
1.69 
1.67 
1.64 
1.62 
1.60 
1.59 
1.57 
1.56 
1.54 
1.53 
1.52 
151 
1.50 
1.49 
1.48 
1.47 
1.46 


Appendix 


1% 


6.63 
4.61 
3.78 
3:32 
3.02 
2.80 
2.64 
2.51 
2.41 
2:32 
2.25 
2.18 
2.13 
2.08 
2.04 
2.00 
1.97 
1.93 
1.90 
1.88 
1.85 
1.83 
1.81 
1.79 
1.77 
1.76 
1.74 
1.72 
1:71 
1.70 


This table contains the 90", 95"", and 99" percentiles of the F,,,.. distribution. These serve as critical values for tests with significance 
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(( TABLESA | Critical Values for the F, n, Distribution—10% Significance Level 
Denominator Numerator Degrees of Freedom (n,) 
Degrees of 
Freedom (n2) 1 2 3 4 5 6 7 8 9 10 
1 39.86 49.50 53.59 55.83 57.24 58.20 58.90 59.44 59.86 60.20 
2 8.53 9.00 9.16 9.24 9.29 9.33 9.35 9.37 9.38 9.39 
3 5.54 5.46 5.39 5.34 5.31 5.28 5.27 523 5.24 5.23 
4 4.54 4.32 4.19 4.11 4.05 4.01 3.98 3.95 3.94 3.92 
5 4.06 3.78 3.62 3.52 3.45 3.40 3.37 3.34 3.32 3.30 
6 3.78 3.46 3.29 3.18 3.11 3.05 3.01 2.98 2.96 2.94 
7 3.59 3.26 3.07 2.96 2.88 2.83 2.78 2.75 2.72 2.70 
8 3.46 3.11 2.92 2.81 2.73 2.67 2.62 2.59 2.56 2.54 
9 3.36 3.01 2.81 2.69 2.61 2.55 2.51 2.47 2.44 2.42 
10 3.29 2.92 2.13 2.61 2.52 2.46 2.41 2.38 2.35 2.32 
11 3.23 2.86 2.66 2.54 2.45 2.39 2.34 2.30 2.27 2.25 
12 3.18 2.81 2.61 2.48 2.39 2.33 2.28 2.24 2.21 2.19 
13 3.14 2.76 2.56 2.43 2.35 2.28 2.23 2.20 2.16 2.14 
14 3.10 2:73 2.52 2.39 2.31 2.24 2.19 2.15 212 2.10 
15 3.07 2.70 2.49 2.36 227 2.21 2.16 2.12 2.09 2.06 
16 3.05 2.67 2.46 2.33 2.24 2.18 2.13 2.09 2.06 2.03 
17 3.03 2.64 2.44 2.31 2.22 2.15 2.10 2.06 2.03 2.00 
18 3.01 2.62 2.42 2.29 2.20 2.13 2.08 2.04 2.00 1.98 
19 2.99 2.61 2.40 2.27 2.18 211 2.06 2.02 1.98 1.96 
20 2.97 2.59 2.38 2.25 2.16 2.09 2.04 2.00 1.96 1.94 
21 2.96 2:57 2.36 2.23 2.14 2.08 2.02 1.98 1.95 1.92 
22 2.95 2.56 2.35 2.22 2.13 2.06 2.01 1.97 1.93 1.90 
23 2.94 2.55 2.34 2.21 211 2.05 1.99 1.95 1.92 1.89 
24 2.93 2.54 2.33 2.19 2.10 2.04 1.98 1.94 1.91 1.88 
25 2.92 2.53 2.32 2.18 2.09 2.02 1.97 1.93 1.89 1.87 
26 2.91 2.52 2.31 2.17 2.08 2.01 1.96 1.92 1.88 1.86 
27 2.90 2.51 2.30 2.17 2.07 2.00 1.95 1.91 1.87 1.85 
28 2.89 2.50 2.29 2.16 2.06 2.00 1.94 1.90 1.87 1.84 
29 2.89 2.50 2.28 2.15 2.06 1.99 1.93 1.89 1.86 1.83 
30 2.88 2.49 2.28 2.14 2.05 1.98 1.93 1.88 1.85 1.82 
60 2.79 2.39 2.18 2.04 1.95 1.87 1.82 1.77 1.74 1.71 
90 2.76 2.36 2.15 2.01 1.91 1.84 1.78 1.74 1.70 1.67 
120 2.75 2.35 2.13 1.99 1.90 1.82 1.77 1.72 1.68 1.65 
œ 2.71 2.30 2.08 1.94 1.85 1.77 1.72 1.67 1.63 1.60 
This table contains the 90" percentile of the F, „n, distribution, which serves as the critical values for a test with a 10% significance 
level. 
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LGE Critical Values for the F,,, n, Distribution—5% Significance Level 


K 


Denominator 


Degrees of 
Freedom (n3) 1 

1 161.40 
2 18.51 
3 10.13 
4 7.71 
5 6.61 
6 5.99 
7 5:59 
8 5.32 
9 5.12 
10 4.96 
11 4.84 
12 4.75 
13 4.67 
14 4.60 
15 4.54 
16 4.49 
17 4.45 
18 4.41 
19 4.38 
20 4.35 
21 4.32 
22 4.30 
23 4.28 
24 4.26 
25 4.24 
26 4.23 
27 4.21 
28 4.20 
29 4.18 
30 4.17 
60 4.00 
90 3.95 
120 3.92 
œ 3.84 


This table contains the 95™ percentile of the distribution F, 


level. 


2 


199.50 
19.00 
9.55 
6.94 
5.79 
5.14 
4.74 
4.46 
4.26 
4.10 
3.98 
3.89 
3.81 
3.74 
3.68 
3.63 
3.59 
3.55 
3.52 
3.49 
3.47 
3.44 
3.42 
3.40 
3.39 
3.37 
3.35 
3.34 
3.33 
3.32 
3.15 
3.10 
3.07 
3.00 


3 


215.70 
19.16 
9.28 
6.59 
5.41 
4.76 
4.35 
4.07 
3.86 
371 
3.59 
3.49 
3.41 
3.34 
3.29 
3.24 
3.20 
3.16 
3.13 
3.10 
3.07 
3.05 
3.03 
3.01 
2.99 
2.98 
2.96 
2.95 
2.93 
2.92 
2.76 
2.71 
2.68 
2.60 


Numerator Degrees of Freedom (n;) 


4 


224.60 
19.25 
9.12 
6.39 
5.19 
4.53 
4.12 
3.84 
3.63 
3.48 
3.36 
3.26 
3.18 
3.11 
3.06 
3.01 
2.96 
2.93 
2.90 
2.87 
2.84 
2.82 
2.80 
2.78 
2.76 
2.74 
2.73 
2.71 
2.70 
2.69 
2.53 
2.47 
2.45 
2.37 


nio 


5 


230.20 
19.30 
9.01 
6.26 
5:05 
4.39 
3.97 
3.69 
3.48 
3:33 
3.20 
3.11 
3.03 
2.96 
2.90 
2.85 
2.81 
2.77 
2.74 
2.71 
2.68 
2.66 
2.64 
2.62 
2.60 
2.59 
2.57 
2.56 
2.55 
2.53 
2.37 
2.32 
2.29 
2.21 


6 


234.00 
19.33 
8.94 
6.16 
4.95 
4.28 
3.87 
3.58 
3.37 
3.22 
3.09 
3.00 
2.92 
2.85 
2.79 
2.74 
2.70 
2.66 
2.63 
2.60 
2.57 
2.55 
2.53 
2.51 
2.49 
2.47 
2.46 
2.45 
2.43 
2.42 
2.25 
2.20 
2.18 
2.10 


7 


236.80 
19.35 
8.89 
6.09 
4.88 
4.21 
379 
3.50 
3:29 
3.14 
3.01 
2.91 
2.83 
2.76 
2.11 
2.66 
2.61 
2.58 
2.54 
2.51 
2.49 
2.46 
2.44 
2.42 
2.40 
2.39 
2.37 
2.36 
2.35 
2:33 
2.17 
2.11 
2.09 
2.01 


8 


238.90 
19.37 
8.85 
6.04 
4.82 
4.15 
3.73 
3.44 
3:23 
3.07 
2.95 
2.85 
2.77 
2.70 
2.64 
2.59 
2:59 
251 
2.48 
2.45 
2.42 
2.40 
2.37 
2.36 
2.34 
2.32 
2.31 
2.29 
2.28 
2.27 
2.10 
2.04 
2.02 
1.94 


Appendix 


9 


240.50 
19.39 
8.81 
6.00 
4.77 
4.10 
3.68 
3.39 
3.18 
3.02 
2.90 
2.80 
2.71 
2.65 
2.59 
2.54 
2.49 
2.46 
2.42 
2.39 
2.37 
2.34 
2.32 
2.30 
2.28 
2.27 
2.25 
2.24 
2.22 
2.21 
2.04 
1.99 
1.96 
1.88 
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10 


241.90 
19.40 
8.79 
5.96 
4.74 
4.06 
3.64 
3:35 
3.14 
2.98 
2.85 
2.75 
2.67 
2.60 
2.54 
2.49 
2.45 
2.41 
2.38 
2.35 
2.32 
2.30 
2.27 
2.25 
2.24 
2.22 
2.20 
2.19 
2.18 
2.16 
1.99 
1.94 
1.91 
1.83 


which serves as the critical values for a test with a 5% significance 
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Critical Values for the F,,, ,, Distribution—1% Significance Level 
Denominator Numerator Degrees of Freedom (n;) 
Degrees of 
Freedom (n2) 1 2 3 4 5 6 7 8 9 10 
1 4052.00 4999.00 5403.00 5624.00 5763.00 5859.00 5928.00 5981.00 6022.00 6055.00 
2 98.50 99.00 99.17 99.25 99.30 99.33 99.36 99.37 99.39 99.40 
3 34.12 30.82 29.46 28.71 28.24 27.91 27.67 27.49 27.35 27.23 
4 21.20 18.00 16.69 15.98 15.52 15.21 14.98 14.80 14.66 14.55 
5 16.26 13:27 12.06 11.39 10.97 10.67 10.46 10.29 10.16 10.05 
6 13.75 10.92 9.78 9.15 8.75 8.47 8.26 8.10 7.98 7.87 
7 12.25 9.55 8.45 785 746 719 6.99 6.84 6.72 6.62 
8 11.26 8.65 7.59 701 6.63 6.37 6.18 6.03 5.91 5.81 
9 10.56 8.02 6.99 6.42 6.06 5.80 5.61 5.47 5:35: 5.26 
10 10.04 7.56 6.55 5.99 5.64 5.39 5.20 5.06 4.94 4.85 
11 9.65 721 6.22 5.67 5.32 5.07 4.89 4.74 4.63 4.54 
12 9.33 6.93 5.95 5.41 5.06 4.82 4.64 4.50 4.39 4.30 
13 9.07 6.70 5.74 5.21 4.86 4.62 4.44 4.30 4.19 4.10 
14 8.86 6.51 5.56 5.04 4.69 4.46 4.28 4.14 4.03 3.94 
15 8.68 6.36 5.42 4.89 4.56 4.32 4.14 4.00 3.89 3.80 
16 8.53 6.23 5.29 4.77 4.44 4.20 4.03 3.89 3.78 3.69 
17 8.40 6.11 5.18 4.67 4.34 4.10 3.93 3.79 3.68 3.59 
18 8.29 6.01 5.09 4.58 4.25 4.01 3.84 3.71 3.60 3.51 
19 8.18 5.93 5.01 4.50 4.17 3.94 3.77 3.63 3.52 3.43 
20 8.10 5.85 4.94 4.43 4.10 3.87 3.70 3.56 3.46 3.37 
21 8.02 5.78 4.87 4.37 4.04 3.81 3.64 3.51 3.40 3.31 
22 795 5.72 4.82 4.31 3.99 3.76 3.59 3.45 3.35 3.26 
23 788 5.66 4.76 4.26 3.94 3.71 3.54 3.41 3.30 3.21 
24 7.82 5.61 4.72 4.22 3.90 3.67 3.50 3.36 3.26 3.17 
25 TAT 5.57 4.68 4.18 3.85 3.63 3.46 3.32 3.22 3.13 
26 772 5.33 4.64 4.14 3.82 3.59 3.42 3.29 3.18 3.09 
27 7.68 5.49 4.60 4.11 3.78 3.56 3.39 3.26 3.15 3.06 
28 7.64 5.45 4.57 4.07 3.75 3.53 3.36 3.23 3.12 3.03 
29 7.60 5.42 4.54 4.04 3.73 3.50 3.33 3.20 3.09 3.00 
30 756 5.39 4.51 4.02 3.70 3.47 3.30 3.17 3.07 2.98 
60 708 4.98 4.13 3.65 3.34 3.12 2.95 2.82 2.72 2.63 
90 6.93 4.85 4.01 3.53 3.23 3.01 2.84 2.72 2.61 2.52 
120 6.85 4.79 3.95 3.48 3.17 2.96 2.79 2.66 2.56 2.47 
œ 6.63 4.61 3.78 3.32 3.02 2.80 2.64 2.51 2.41 2.32 
This table contains the 99* percentile of the F, „n, distribution, which serves as the critical values for a test with a 1% significance 
level. 
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Acceptance region: The set of values of a test statistic 
for which the null hypothesis is accepted (is not 
rejected). 

ADF: See augmented Dickey—Fuller (ADF) test. 


Adjusted R?(R”): A modified version of R? that does 
not necessarily increase when a new regressor is 
added to the regression. 

ADL(p, q): See autoregressive distributed lag (ADL) 
model. 


AIC: See information criterion. 

Akaike information criterion (AIC): See information 
criterion. 

Alternative hypothesis: The hypothesis that is 
assumed to be true if the null hypothesis is false. 
The alternative hypothesis is often denoted H}. 


ARCH: See autoregressive conditional heteroskedas- 
ticity (ARCH). 

AR(p): See autoregression. 

Asymptotic distribution: The approximate sampling 
distribution of a random variable computed using 
a large sample. For example, the asymptotic distri- 
bution of the sample average is normal. 


Asymptotic normal distribution: A normal distribu- 
tion that approximates the sampling distribution 
of a statistic computed using a large sample. 

Attrition: The loss of subjects from a study after 
assignment to the treatment or the control group. 

Augmented Dickey—Fuller (ADF) statistic: A 
regression-based statistic used to test for a unit 
root in an AR(p) model. 

Autocorrelation: The correlation between a time 
series variable and its lagged value. The j™ auto- 
correlation of Y is the correlation between Y, 
and Y, j. 

Autocovariance: The covariance between a time 
series variable and its lagged value. The j™ 
autocovariance of Y is the covariance between Y, 
and Y, 

Autoregression: A linear regression model that 
relates a time series variable to its past (that is, 
lagged) values. An autoregression with p lagged 
values as regressors is denoted AR(p). 

Autoregressive conditional heteroskedasticity 
(ARCH): A time series model of conditional 
heteroskedasticity. 

Autoregressive distributed lag (ADL) model: A 
linear regression model in which the time series 
variable Y, is expressed as a function of lags of Y, 
and of another variable, X,. The model is denoted 


ADL(p, q), where p denotes the number of lags of 
Y, and q denotes the number of lags of X,. 


Average causal effect: The population average of the 
individual causal effects in a heterogeneous popu- 
lation. Also called the average treatment effect. 

Average treatment effect: See average causal effect. 


Balanced panel: A panel data set with no missing 
observations; that is, the variables are observed for 
each entity and each time period. 

Base specification: A baseline or benchmark regres- 
sion specification that includes a set of regressors 
chosen using a combination of expert judgment, 
economic theory, and knowledge of how the data 
were collected. 

Bayes information criterion (BIC): See information 
criterion. 

Bayes rule: the conditional probability of Y given X 
is the conditional probability of X given Y times 
the relative marginal probabilities of Y and X: 


Bernoulli distribution: The probability distribution of 
a Bernoulli random variable. 

Bernoulli random variable: A random variable that 
takes on one of two values, 0 and 1. Also known as 
a binary random variable. 

Best Linear Unbiased Estimator (BLUE): An 
estimator that has the smallest variance of any 
estimator that is a linear function of the sample 
values Y and is unbiased. Under the Gauss- 
Markov conditions, the ordinary least squares 
estimator is the Best Linear Unbiased Estimator 
of the regression coefficients conditional on the 
values of the regressors. 

Bias: The expected value of the difference between 
an estimator and the parameter that it is 
estimating. If fy is an estimator of uy, then the 
bias of fy is E(fy) — py. 

BIC: See information criterion. 

Binary variable: A variable that is either 0 or 1. A 
binary variable is used to indicate a binary out- 
come. For example, X is a binary (or indicator, or 
dummy) variable for a person’s sex if X = 1 if the 
person is female and X = 0 if the person is male. 

Bivariate normal distribution: A generalization 
of the normal distribution to describe the joint 
distribution of two random variables. 


Bivariate normal p.d.f.: See bivariate normal 
distribution. 

BLUE: See Best Linear Unbiased Estimator 
(BLUE). 
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Bonferroni test: A way to test a joint hypothesis 
by testing the component individual hypotheses 
one at a time, using an adjusted critical value that 
accounts for the multiple hypotheses being tested. 

Break date: The date of a discrete change in popula- 
tion time series regression coefficient(s). 

Causal effect: The expected effect of a given inter- 
vention or treatment on an outcome as measured 
in an ideal randomized controlled experiment. 

Causal inference: Tests, confidence intervals, and/or 
estimation of a causal effect. 

c.d.f.; Cumulative distribution function. See cumula- 
tive probability distribution. 

Central limit theorem: In mathematical statistics, 
under general conditions, the sampling distribution 
of the standardized sample average is well approx- 
imated by a standard normal distribution when the 
sample size is large. 

Chi-squared distribution: The distribution of the 
sum of m squared independent standard normal 
random variables. The parameter 7n is called 
the degrees of the freedom of the chi-squared 
distribution. 

Chow test: A test for a break in a time series regres- 
sion at a known break date. 


Classical measurement error model: The observed 
value of a random variable equals its true, unob- 
served value plus independent measurement error. 


Clustered standard errors: A method of computing 
standard errors that is appropriate for panel data. 


Coefficient of determination: See R’. 

Cointegration: When two or more time series vari- 
ables share a common stochastic trend. 

Common component: In a dynamic factor model, the 
part of a time series variable that is explained by 
the common unobserved factors. 

Common trend: A trend shared by two or more time 
series. 

Conditional distribution: The probability distribution 
of one random variable given that another random 
variable takes on a particular value. 


Conditional expectation: The expected value of one 
random variable given that another random vari- 
able takes on a particular value. 


Conditional heteroskedasticity: The variance, usually 
of an error term, depends on other variables. 

Conditional mean: The mean of a conditional distri- 
bution. See conditional expectation. 

Conditional mean independence: The conditional 
expectation of the regression error u; given the 
regressors depends on some but not all of the 
regressors. 

Conditional variance: The variance of a conditional 
distribution. 


Confidence interval (confidence set): An interval (or 
set) constructed from sample data that contains 


the true value of a population parameter with 
a prespecified probability when computed over 
repeated samples. 


Confidence level: The prespecified probability that a 
confidence interval (or set) contains the true value 
of the parameter. 


Consistency: The property that an estimator is con- 
sistent. See consistent estimator. 

Consistent estimator: An estimator that converges in 
probability to the parameter that it is estimating. 


Constant regressor: The regressor associated with the 
regression intercept; this regressor is always equal 
to 1. 

Constant term: The regression intercept. 


Continuous mapping theorem: If a random variable 
S,, converges in distribution to S, then a continuous 
function of that random variable, g(S,,), converges 
in distribution to g(S). 

Continuous random variable: A random variable that 
takes on a continuum of values. 

Control group: The group that does not receive the 
treatment or intervention in an experiment. 

Control variable: A regressor that controls for an omit- 
ted factor that determines the dependent variable. 


Converge in probability: When a sequence of random 
variables converges to a specific value; for exam- 
ple, when the sample average becomes close to the 
population mean as the sample size increases; see 
Key Concept 2.6 and Section 18.2. 

Convergence in distribution: When a sequence of dis- 
tributions converges to a limit; a precise definition 
is given in Section 18.2. 


Correlation: A unit-free measure of the extent 
to which two random variables move, or vary, 
together. The correlation (or correlation coef- 
ficient) between X and Y is oyy/oyoy and is 
denoted corr(X, Y). 

Correlation coefficient: See correlation. 

Covariance: A measure of the extent to which 
two random variables move together. The 
covariance between X and Y is the expected value 
E [(X — ux)(Y — uy)] and is denoted cov(X, Y) 
or Oyy. 


Covariance matrix: A matrix composed of the 
variances and covariances of a vector of random 
variables. 

Coverage probability: The probability that a con- 
fidence interval contains the true value of the 
coefficient. 

Critical value: The value of a test statistic for which 
the test just rejects the null hypothesis at the given 
significance level. 

Cross-sectional data: Data collected for different 
entities in a single time period. 

Cubic regression model: A nonlinear regression func- 
tion that includes X, X?, and X? as regressors. 


Cumulative distribution function (c.d.f.): See 
cumulative probability distribution. 

Cumulative dynamic multiplier: The cumulative 
effect of a unit change in the time series vari- 
able X on Y. The h-period cumulative dynamic 
multiplier is the effect of a unit change in X; on 
Yt Yap too + Yan 


Cumulative probability distribution: A function 
showing the probability that a random variable is 
less than or equal to a given number. 


Dependent variable: The variable to be explained 
in a regression or other statistical model; the 
variable appearing on the left-hand side in a 
regression. 


Deterministic trend: A persistent long-term move- 
ment of a variable over time that can be repre- 
sented as a nonrandom function of time. 


DFM: See dynamic factor model (DFM). 


Dickey-Fuller statistic: A regression-based statistic 
used to test for a unit root in a first-order autore- 
gression [AR(1)]. 

Differences estimator: An estimator of the causal 
effect constructed as the difference in the sample 
average outcomes between the treatment and con- 
trol groups. 


Differences-in-differences estimator: The average 
change in Y for those in the treatment group 
minus the average change in Y for those in the 
control group. 


Discrete random variable: A random variable that 
takes on discrete values. 


Distributed lag model: A regression model in which 
the regressors are current and lagged values of X. 


Dummy variable: See binary variable. 


Dummy variable trap: A problem caused by includ- 
ing a full set of binary variables in a regression 
together with a constant regressor (intercept), 
leading to perfect multicollinearity. 


Dynamic causal effect: The causal effect of one 
variable on current and future values of another 
variable. 


Dynamic factor model (DFM): A representation 
of N time series variables, where each variable is 
expressed as the sum of a reduced number r of 
common unobserved factors plus an idiosyncratic 
disturbance that is uncorrelated with the factors 
and the idiosyncratic disturbances of the other 
variables. 


Dynamic multiplier: The -period dynamic multiplier 
is the effect of a unit change in the time series 
variable X, on Yap- 

Endogenous variable: A variable that is correlated 
with the error term. 


Entity and time fixed effects regression model: A 
panel data regression that includes both entity 
fixed effects and time fixed effects. 
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Entity fixed effects: a set of variables that provide for 
each entity in a panel data regression to have its 
own intercept. 


Errors-in-variables bias: The bias in an estimator of 
a regression coefficient that arises from measure- 
ment errors in the regressors. 

Error term: The difference between Y and the popu- 
lation regression function, denoted u in this text. 


ESS: See explained sum of squares (ESS). 


Estimate: The numerical value of an estimator com- 
puted using data from a specific sample. 

Estimator: A function of a sample of data to be 
drawn randomly from a population. An estimator 
uses sample data to compute an educated guess of 
the value of a population parameter, such as the 
population mean. 

Exact (finite-sample) distribution: The exact prob- 
ability distribution of a random variable. 

Exact identification: When the number of instrumen- 
tal variables equals the number of endogenous 
regressors. 

Exogenous variable: A variable that is uncorrelated 
with the regression error term. 


Expectation: See expected value. 


Expected value: The long-run average value of a ran- 
dom variable over many repeated trials or occur- 
rences. It is the probability-weighted average of all 
possible values that the random variable can take 
on. The expected value of Y is denoted E(Y) and is 
also called the expectation of Y. 


Experimental data: Data obtained from an 
experiment designed to evaluate a treatment or 
policy or to investigate a causal effect. 


Explained sum of squares (ESS): The sum of squared 
deviations of the predicted values of Y;, Y;, from 
their average; see Equation (4.14). 


Explanatory variable: See regressor. 


External validity: Inferences and conclusions from a 
statistical study are externally valid if they can be 
generalized from the population and the setting 
studied to other populations and settings. 


Fan chart: a time series plot that displays a forecast 
distribution (forecast uncertainty) as a function of 
the forecast horizon. 


Feasible GLS estimator: A version of the general- 
ized least squares (GLS) estimator that uses an 
estimator of the conditional variance of the regres- 
sion errors and covariance between the regression 
errors at different observations. 


Feasible WLS: A version of the weighted least 
squares (WLS) estimator that uses an estimator 
of the conditional variance of the regression errors. 


Final prediction error (FPE): An estimator of the 
mean squared forecast error when the regres- 
sion coefficients are estimated by ordinary least 
squares. 
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First difference: The first difference of a time series 
variable Y, is Y,- Y;;, denoted AY,. 

First-stage regression: The regression of an included 
endogenous variable on the included exogenous 
variables, if any, and the instrumental variable(s) 
in two stage least squares. 

Fitted value: See predicted value. 


Fixed effects: Binary variables indicating the entity 
or time period in a panel data regression. 


Fixed effects regression model: A panel data regres- 
sion that includes entity fixed effects. 

F m,n distribution: The distribution of a ratio of inde- 
pendent random variables, where the numerator 
is a chi-squared random variable with m degrees 
of freedom, divided by m, and the denominator is 
an independently distributed chi-squared random 
variable with n degrees of freedom, divided by n. 


Fn,» distribution: The distribution of a random 
variable with a chi-squared distribution with m 
degrees of freedom, divided by m. 

Forecast error: The difference between the value of 
the variable that actually occurs and its forecasted 
value. 


Forecast interval: An interval that contains the future 
value of a time series variable with a prespecified 
probability. 

FPE: See final prediction error. 


F-statistic: A statistic used to test a joint hypoth- 
esis concerning more than one of the regression 
coefficients. 


Functional form misspecification: When the form of 
the estimated regression function does not match 
the form of the population regression function; 
for example, when a linear specification is used 
but the true population regression function is 
quadratic. 

GARCH: See generalized autoregressive conditional 
heteroskedasticity (GARCH). 

Gauss—Markov theorem: Under certain conditions, 
the ordinary least squares estimator is the best 
linear unbiased estimator of the regression coeffi- 
cients conditional on the values of the regressors. 

Generalized autoregressive conditional heteroskedas- 
ticity (GARCH): A time series model for condi- 
tional heteroskedasticity. 

Generalized least squares (GLS): A generalization of 
ordinary least squares that is appropriate when the 
regression errors have a known form of heteroske- 
dasticity (in which case GLS is also referred to as 
weighted least squares, or WLS) or a known form 
of serial correlation. 

Generalized method of moments (GMM): A 
method for estimating parameters by fitting 
sample moments to population moments that 
are functions of the unknown parameters. 
Instrumental variables estimators are an important 
special case. 


GLS: See generalized least squares (GLS). 

GMM: See generalized method of moments (GMM). 

Granger causality test: A procedure for testing 
whether current and lagged values of one time 
series help predict future values of another time 
series. 


HAC standard errors: See heteroskedasticity- and 
autocorrelation-consistent (HAC) standard errors. 

Hawthorne effect: The phenomenon that 
experimental subjects change their behavior 
because they know they are subjects in an 
experiment. 


Heteroskedasticity: The variance of the regression 
error term u;, conditional on the regressors, is not 
constant. 


Heteroskedasticity- and autocorrelation-consistent 
(HAC) standard errors: Standard errors for ordi- 
nary least squares estimators that are consistent 
whether or not the regression errors are hetero- 
skedastic and/or autocorrelated. 


Heteroskedasticity- and autocorrelation-robust 
(HAR) standard errors: Another term for HAC 
standard errors. 

Heteroskedasticity-robust standard error: A standard 
error for the ordinary least squares estimator that 
is appropriate whether the error term is homoske- 
dastic or heteroskedastic. 

Heteroskedasticity-robust t-statistic: A t-statistic 
constructed using a heteroskedasticity-robust stan- 
dard error. 


Homoskedasticity: The variance of the regression 
error term u;, conditional on the regressors, is 
constant. 


Homoskedasticity-only F-statistic: A form of the 
F-statistic that is valid only when the regression 
errors are homoskedastic. 

Homoskedasticity-only standard errors: Standard 
errors for the ordinary least squares estimator 
that are appropriate only when the error term is 
homoskedastic. 


Hypothesis test: A procedure for using sample evi- 
dence to help determine if a specific hypothesis 
about a population is true or false. 


(0), (1), and I(2): See order of integration. 

Identically distributed: When two or more random 
variables have the same distribution. 

Idiosyncratic component: In a dynamic factor 
model, the part of a time series variable that is not 
explained by the common unobserved factors. 

iid. See independently and identically distributed 
(iid) 

Impact effect: The contemporaneous, or immediate, 
effect of a unit change in the time series variable 
X,on Y, 

Imperfect multicollinearity: The condition in which 
two or more regressors are highly correlated. 


Included endogenous variables: Regressors that are 
correlated with the error term (usually in the con- 
text of instrumental variable regression). 

Included exogenous variables: Regressors that are 
uncorrelated with the error term (usually in the 
context of instrumental variable regression). 


Independence: When knowing the value of one ran- 
dom variable provides no information about the 
value of another random variable. Two random 
variables are independent if their joint distribution 
is the product of their marginal distributions. 

Independently and identically distributed (i.i.d.): When 
two or more independent random variables have 
the same distribution. 


Indicator variable: See binary variable. 


Information criterion: A statistic used to estimate 
the number of lagged variables to include in an 
autoregression or a distributed lag model. Leading 
examples are the Akaike information criterion 
(AIC) and the Bayes information criterion (BIC). 

In-sample prediction: The predicted value of the 
dependent variable for an observation in the 
sample used to estimate the prediction model. 


Instrument: See instrumental variable. 


Instrument exogeneity condition: The requirement 
that an instrumental variable is uncorrelated with 
the error term in the instrumental variables regres- 
sion equation. 


Instrument relevance condition: The requirement 
that an instrumental variable is correlated with the 
included endogenous regressor. 

Instrumental variable: A variable that is correlated 
with an endogenous regressor (instrument rel- 
evance) and is uncorrelated with the regression 
error (instrument exogeneity). 


Instrumental variables (IV) regression: A way to 
obtain a consistent estimator of the unknown coef- 
ficients of the function relating Y to X when the 
regressor, X, is correlated with the error term, u. 


Interaction term: A regressor that is formed as the 
product of two other regressors, such as X 1; X Xj). 


Intercept: The value of in the linear regression 
model. 


Internal validity: When inferences about causal 
effects in a statistical study are valid for the popu- 
lation being studied. 

IV: See instrumental variables (IV) regression. 

Joint hypothesis: A hypothesis consisting of two or 
more individual hypotheses — that is, involving 
more than one restriction on the parameters of a 
model. 

Joint probability distribution: The probability distri- 
bution determining the probabilities of outcomes 
involving two or more random variables. 

J-statistic: A statistic for testing overidentifying 
restrictions in instrumental variables regression. 
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Kurtosis: A measure of how much mass is contained 
in the tails of a probability distribution. 

Lag: The value of a time series variable in a previous 
time period. The j' lag of Y, is Y, ;. 

Lasso (least absolute shrinkage and selection 
operator): The regression estimator that mini- 
mizes a penalized sum of squared residuals, where 
the penalty term is proportional to the sum of the 
absolute values of the regression coefficients. 

Law of iterated expectations: A result in probability 
theory that says that the expected value of Y is the 
expected value of its conditional expectation given 
X—that is, that E(Y) = E[E(Y|X)]. 

Law of large numbers: According to this result from 
probability theory, under general conditions the 
sample average will be close to the population 
mean with very high probability when the sample 
size is large. 

Least squares assumptions: The assumptions for the 
linear regression models listed in Key Concept 4.3 
(single variable regression model) and Key 
Concept 6.4 (multiple regression model). 

Least squares estimator: An estimator formed by 
minimizing the sum of squared residuals. 


Leptokurtic: A distribution that has heavier tails than 
a normal, as measured by a kurtosis exceeding 3. 

Likelihood function: The joint probability distri- 
bution of the data, treated as a function of the 
unknown coefficients. 


Limited dependent variable: A dependent variable 
that can take on only a limited set of values. For 
example, the variable might be a 0-1 binary vari- 
able or arise from one of the models described in 
Appendix 11.3. 

Linear-log model: A nonlinear regression function 
in which the dependent variable is Y and the 
independent variable is In(X). 


Linear probability model: A regression model in 
which Y is a binary variable. 


Linear regression function: A regression function 
with a constant slope. 


Local average treatment effect: A weighted average 
treatment effect estimated, for example, by two 
stage least squares. 


Logarithm: See natural logarithm. 


Logit regression: A nonlinear regression model for a 
binary dependent variable in which the population 
regression function is modeled using the cumula- 
tive logistic distribution function. 


Log-linear model: A nonlinear regression function 
in which the dependent variable is In(Y) and the 
independent variable is X. 

Log-log model: A nonlinear regression function in 
which the dependent variable is In(Y) and the 
independent variable is In(X). 


Longitudinal data: See panel data. 
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Long-run cumulative dynamic multiplier: The cumu- 
lative long-run effect on the time series variable Y 
of a change in X. 

Marginal probability distribution: Another name for 
the probability distribution of a random variable 
Y, which distinguishes the distribution of Y alone 
(the marginal distribution) from the joint distribu- 
tion of Y and another random variable. 

Maximum likelihood estimator (MLE): An estimator 
of unknown parameters that is obtained by maxi- 
mizing the likelihood function; see Appendix 11.2. 


Mean: The expected value of a random variable. The 
mean of Y is denoted py. 

Mean squared forecast error (MSFE): The expected 
value of the square of the time series forecast 
error for an observation not in the data set used 
for estimating the forecasting model. 

Mean squared prediction error (MSPE): The 
expected value of the square of the prediction 
error for an observation not in the data set used 
for estimating the prediction model. 

m-fold cross validation: A method for estimating the 
mean squared prediction error by first dividing the 
in-sample data into m subsamples and then sequen- 
tially forming predictions for the observations in 
each subsample using the data not in that subsample. 

MLE: See maximum likelihood estimator (MLE). 


Moments of a distribution: The expected value of a 
random variable raised to different powers. The r” 
moment of the random variable Y is E(Y’). 
MSFE: See mean squared forecast error (MSFE). 
MSPE: See mean squared prediction error (MSPE). 
Multicollinearity: See perfect multicollinearity and 
imperfect multicollinearity. 

Multiple regression model: An extension of the 
single variable regression model that allows Y to 
depend on k regressors. 

Multi-step ahead forecast: A forecast made for more 


than one period beyond the final observation used 
to make the forecast. 


Natural experiment: See quasi-experiment. 


Natural logarithm: A mathematical function 
defined for a positive argument; its slope is always 
positive but tends to zero. The natural logarithm 
is the inverse of the exponential function; that is, 
X = In(e*). 

95% confidence set: A confidence set with a 95% 
confidence level. See confidence interval. 


Nonlinear least squares: The analog of ordinary 
least squares that applies when the regression 
function is a nonlinear function of the unknown 
parameters. 


Nonlinear least squares estimator: The estimator 
obtained by minimizing the sum of squared residu- 
als when the regression function is nonlinear in the 
parameters. 


Nonlinear regression function: A regression function 
with a slope that is not constant. 

Nonstationary: When the joint distribution of one or 
more time series variables and their lagged values 
changes over time. 

Normal distribution: A commonly used bell-shaped 
distribution of a continuous random variable. 

Nowcast: The forecast of the value of a time series 
variable for the current period—that is, the period 
in which the forecast is made. 


Null hypothesis: The hypothesis being tested in a 
hypothesis test, often denoted Ho. 

Observational data: Data based on observing, or 
measuring, actual behavior outside an experimen- 
tal setting. 

Observation number: The unique identifier assigned 
to each entity in a data set. 

OLS estimator: See ordinary least squares (OLS) 
estimator. 


OLS regression line: The regression line with popu- 
lation coefficients replaced by the ordinary least 
squares estimators. 

OLS residual: The difference between Y; and the 
ordinary least squares regression line, denoted ii; 
in this text. 

Omitted variables bias: The bias in an estimator that 
arises because a variable that is a determinant of Y 
and is correlated with a regressor has been omitted 
from the regression. 

One-sided alternative hypothesis: The parameter of 
interest is on one side of the value given by the 
null hypothesis. 


One-step ahead forecast: A forecast made for the 
period immediately following the final observation 
used to make the forecast. 


Oracle prediction: The infeasible best-possible pre- 
diction, which is made using the unknown condi- 
tional mean of the variable to be predicted given 
the predictors. 


Order of integration: The number of times that a 
time series variable must be differenced to make it 
stationary. A time series variable that is integrated 
of order d must be differenced d times and is 
denoted J(d). 


Ordinary least squares (OLS) estimators: The esti- 
mators of the regression intercept and slope(s) 
that minimize the sum of squared residuals. 


Out-of-sample prediction: The predicted value of 
the dependent variable for an observation not 
in the sample used to estimate the prediction 
model. 


Outlier: An exceptionally large or small value of a 
random variable. 

Overidentification: When the number of instru- 
mental variables exceeds the number of included 
endogenous regressors. 


Panel data: Data collected for multiple entities 
where each entity is observed in two or more time 
periods. 

Parameters: Constants that determine a character- 
istic of a probability distribution or population 
regression function. 

Partial compliance: The failure of some participants 
to follow the treatment protocol in a randomized 
experiment. 

Partial effect: The effect on Y of changing one of 
the regressors while holding the other regressors 
constant. 

p.d.f.: See probability density function (p.d.f.). 

Penalized sum of squared residuals: The sum of the 
sum of squared residuals and a penalty term that 
increases with the number and/or values of the 
regression coefficients. 

Penalty term: A term that, when added to the sum 
of squared residuals, penalizes the estimator for 
choosing a large number of regressors and/or coef- 
ficients with large values. 

Perfect multicollinearity: A situation in which one 
of the regressors is an exact linear function of the 
other regressors. 

Polynomial regression model: A nonlinear regression 
function that includes X, X’,..., and X” as regres- 
sors, where r is an integer. 

Population: The group of entities—such as people, 
companies, or school districts— being studied. 

Population coefficients: See population intercept and 
slope. 

Population intercept and slope: The true, or popu- 
lation, values of By (the intercept) and £; (the 
slope) in a single-variable regression. In a multiple 
regression, there are multiple slope coefficients 
(Bi, Bo, - - - , Bk), one for each regressor. 

Population multiple regression model: The multiple 
regression model in Key Concept 6.2. 


Population regression line: In a single-variable 
regression, the population regression line is 
By + B,X;, and in a multiple regression, it is 
Bo + Bix + BoXy +++ + BX. 

Potential outcomes: The set of outcomes that might 
occur to an individual (treatment unit) after receiv- 
ing, or not receiving, an experimental treatment. 


Power of a test: The probability that a test correctly 
rejects the null hypothesis when the alternative is 
true. 


Predicted value: The value of Y; that is predicted by 
the ordinary least squares regression line, denoted 
Y; in this text. 


Price elasticity of demand: The percentage change 
in the quantity demanded resulting from a 1% 
increase in price. 


Principal components: The linear combinations of a 
set of standardized variables for which the j® 
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linear combination maximizes its variance, 
subject to being uncorrelated with the previous 
j- 1 linear combinations. 


Probability: The proportion of time that an outcome 
(or event) from a random experiment occurs in 
the long run. 

Probability density function (p.d.f.): For a continu- 
ous random variable, the area under the probabil- 
ity density function between any two points is the 
probability that the random variable falls between 
those two points. 

Probability distribution: For a discrete random vari- 
able, a list of all values that a random variable can 
take on and the probability associated with each of 
these values. 


Probit regression: A nonlinear regression model 
for a binary dependent variable in which the 
population regression function is modeled using 
the cumulative standard normal distribution 
function. 

Program evaluation: The field of study concerned 
with estimating the effect of a program, policy, or 
some other intervention or “treatment.” 


Pseudo out-of-sample forecast: A forecast com- 
puted over part of the sample using a procedure 
that is as if these sample data have not yet been 
realized. 

p-value (significance probability): The probability of 
drawing a statistic at least as adverse to the null 
hypothesis as the one actually computed, assum- 
ing the null hypothesis is correct. Also called the 
marginal significance probability, the p-value is 
the smallest significance level at which the null 
hypothesis can be rejected. 

Quadratic regression model: A nonlinear regression 
function that includes X and X? as regressors. 


Quandt likelihood ratio statistic: A statistic used with 
time series data to test for a break in the regres- 
sion model at an unknown date. 

Quasi-experiment: A circumstance in which random- 
ness is introduced by variations in individual cir- 
cumstances that make it appear as if the treatment 
is randomly assigned. 

R?: In a regression, the fraction of the sample vari- 
ance of the dependent variable that is explained 
by the regressors. 

R?: See adjusted R’. 

Randomized controlled experiment: An experiment 
in which participants are randomly assigned to a 
control group, which receives no treatment, or to a 
treatment group, which receives a treatment. 


Random walk: A time series process in which the 
value of the variable equals its value in the previ- 
ous period plus an unpredictable error term. 

Random walk with drift: A generalization of the ran- 
dom walk in which the change in the variable has a 
nonzero mean but is otherwise unpredictable. 
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Realized volatility: The sample root mean square of 
a time series variable computed over consecutive 
time periods. 

Regressand: See dependent variable. 

Regression discontinuity: A regression involving a 
quasi-experiment in which treatment depends 


on whether an observable variable crosses a 
threshold. 


Regression specification: A description of a regres- 
sion that includes the set of regressors and any 
nonlinear transformation that has been applied. 

Regressor: A variable appearing on the right-hand 
side of a regression; an independent variable in a 
regression. 

Rejection region: The set of values of a test statistic 
for which the test rejects the null hypothesis. 

Repeated cross-sectional data: A collection of cross- 
sectional data sets, where each cross-sectional data 
set corresponds to a different time period. 

Residual: The difference between the observed value 
of the dependent variable and its value predicted 
by an estimated regression, for an observation 
in the sample used to estimate the regression 
coefficients, denoted ù; in the text. 


Restricted regression: A regression in which the coef- 
ficients are restricted to satisfy some condition. For 
example, when computing the homoskedasticity- 
only F-statistic, it is the regression with coefficients 
restricted to satisfy the null hypothesis. 


Ridge regression: The regression estimator that mini- 
mizes a penalized sum of squared residuals, where 
the penalty term is proportional to the sum of the 
squared regression coefficients. 


RMSFE: See root mean squared forecast error 
(RMSFE). 

Root mean squared forecast error (RMSFE): The 
square root of the mean squared forecast error. 

Sample correlation coefficient (sample correlation): 
An estimator of the correlation between two ran- 
dom variables. 

Sample covariance: An estimator of the covariance 
between two random variables. 

Sample selection bias: The bias in an estimator of a 
regression coefficient that arises when a selection 
process influences the availability of data and that 
process is related to the dependent variable. This 
bias induces correlation between one or more 
regressors and the regression error. 

Sample standard deviation: An estimator of the pop- 
ulation standard deviation of a random variable. 

Sample variance: An estimator of the population 
variance of a random variable. 


Sampling distribution: The distribution of a statistic 
over all possible samples; the distribution arising 
from repeatedly evaluating the statistic using a 
series of randomly drawn samples from the same 
population. 


Scatterplot: A plot of n observations on X; and Y; in 
which each observation is represented by the point 
(X; Yi). 

Scree plot: The normalized variance of the ordered 
principal components of a set of variables X, 
plotted against the principal component number, 
where the variance is normalized by the sum of the 
variances of the X’s. 


SER: See standard error of the regression (SER). 
Serial correlation: See autocorrelation. 


Serially uncorrelated: A time series variable with all 
autocorrelations equal to 0. 

Shrinkage estimator: An estimator that introduces 
bias by shrinking the OLS estimator toward a 
specific point (usually 0) and thereby reducing the 
variance of the estimator. 

Significance level: The prespecified rejection prob- 
ability of a statistical hypothesis test when the null 
hypothesis is true. 

Significance probability: See p-value (significance 
probability). 

Simple random sampling: When entities are chosen 
independently from a population using a method 
that ensures that each entity is equally likely to be 
chosen. 

Simultaneous causality: When, in addition to the 
causal link of interest from X to Y, there is a 
causal link from Y to X. Simultaneous causality 
makes X correlated with the error term in the 
function of interest that relates Y to X. 


Simultaneous equations bias: See simultaneous 
causality. 

Size of a test: The probability that a test incorrectly 
rejects the null hypothesis when the null hypoth- 
esis is true. 


Skewness: A measure of the asymmetry of a prob- 
ability distribution. 

Sparse model: A regression model in which the coef- 
ficients are nonzero for only a small fraction of the 
predictors. 

SSR: See sum of squared residuals (SSR). 

Standard deviation: The square root of the variance. 
The standard deviation of the random variable Y, 
denoted oy, has the same units as Y and is a mea- 
sure of the spread of the distribution of Y around 
its mean. 


Standard error of an estimator: An estimator of the 
standard deviation of the estimator. 

Standard error of the regression (SER): An 
estimator of the standard deviation of the 
regression error u. 

Standardized predictive regression model: A spe- 
cial case of the linear multiple regression model 
in which the regressors are standardized and the 
dependent variable is demeaned so that it has 
mean 0. 


Standardized random variable: Subtracting the 
mean and dividing by the standard deviation 
produces a standardized random variable with 
a mean of 0 and a standard deviation of 1. The 
standardized random variable computed from Y 
is (Y — py)/oy. 

Standard normal distribution: The normal distribu- 
tion with mean equal to 0 and variance equal to 1, 
denoted N(0, 1). 

Stationarity: When the joint distribution of a time 
series variable and its lagged values does not 
change over time. 

Statistically insignificant: The null hypothesis 
(typically, that a regression coefficient is 0) cannot 
be rejected at a given significance level. 

Statistically significant: The null hypothesis (typically, 
that a regression coefficient is 0) is rejected at a 
given significance level. 

Stochastic trend: A persistent but random long-term 
movement of a variable over time. 

Strict exogeneity: The requirement that the regres- 
sion error have a mean of 0 conditional on current, 
future, and past values of the regressor in a distrib- 
uted lag model. 

Student ¢ distribution: The Student f distribution 
with m degrees of freedom is the distribution of 
the ratio of a standard normal random variable, 
divided by the square root of an independently 
distributed chi-squared random variable with m 
degrees of freedom divided by m. As m gets large, 
the Student ¢ distribution converges to the stan- 
dard normal distribution. 

Sum of squared residuals (SSR): The sum of the 
squared ordinary least squares residuals. 


t distribution: See Student t distribution. 


Time effects: Binary variables indicating the time 
period in a panel data regression. 


Time fixed effects: See time effects. 


Time series data: Data collected for the same entity 
for multiple time periods. 


Total sum of squares (TSS): The sum of squared 
deviations of Y; from its average. 


t-ratio: See t-statistic. 


Treatment effect: The causal effect in an experiment 
or a quasi-experiment. See causal effect. 


Treatment group: The group that receives the treat- 
ment or intervention in an experiment. 


TSLS: See two stage least squares. 
TSS: See total sum of squares (TSS). 
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t-statistic: A statistic used for hypothesis testing. 
See Key Concept 5.1. 

Two-sided alternative hypothesis: When, under the 
alternative hypothesis, the parameter of inter- 
est is not equal to the value given by the null 
hypothesis. 

Two stage least squares (TSLS): An instrumental 
variable estimator, described in Key Concept 12.2. 


Type I error: In hypothesis testing, the error made 
when the null hypothesis is true but is rejected. 

Type II error: In hypothesis testing, the error 
made when the null hypothesis is false but is not 
rejected. 


Unbalanced panel: A panel data set in which data for 

some entities are missing for some time periods. 

Unbiased estimator: An estimator with a bias that is 
equal to 0. 

ncorrelated: Two random variables are uncorre- 
lated if their correlation is 0. 


nderidentification: When the number of instru- 
mental variables is less than the number of endog- 
enous regressors. 

nit root: An autoregression with a largest root 
equal to 1. 


nrestricted regression: A regression in which the 
coefficients are not restricted to satisfy some con- 
dition. When computing the homoskedasticity-only 
F-statistic, it is the regression that applies under 
the alternative hypothesis, so that the coefficients 
are not restricted to satisfy the null hypothesis. 
VAR: See vector autoregression. 


Variance: The expected value of the squared differ- 
ence between a random variable and its mean; the 
variance of Y is denoted o4. 
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Vector autoregression (VAR): A model of k time 
series variables consisting of k equations, one for 
each variable, in which the regressors in all equa- 
tions are lagged values of all the variables. 


Volatility clustering: When a time series variable 
exhibits some clustered periods of high variance 
and other clustered periods of low variance. 


Weak instruments: Instrumental variables that have a 
low correlation with the endogenous regressor(s). 


Weighted least squares (WLS): An alternative to 
ordinary least squares that can be used when 
the regression error is heteroskedastic and the 
form of the heteroskedasticity is known or can 
be estimated. 


WLS: See weighted least squares (WLS). 
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and fuzzy), 495-496 

Discrete choice data analysis, 414 

Discrete random variables 

defined, 56 

probability distribution of, 56-58, 
S7ft 

Distributed lag model 

with AR(1) errors, 625-627 

assumptions of, 617-618 

autocorrelated u, standard errors and 
inference, 618 

defined, 614 

exogeneity and, 615-616 

OLS estimation of ADL model, 
627-628 

Distributions. See also Statistics; specific 

distribution names 
asymptotic distribution, 85 
Bernoulli distribution, 58 
bivariate normal distribution, 
77,79 
central limit theorem, 86-90, 87f, 
88f, 90f 
chi-squared distribution, 80 
conditional distributions, 66-67, 
67t 
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conditional expectation (mean), 
67—68, 70 
conditional variance, 69 
exact distribution, 85 
F distribution, 80-81 
finite-sample distribution, 85 
joint probability distribution, 65—66, 
66t, 67t 
kurtosis, 63f, 64 
large-sample approximations, 85-90, 
87f, 88f, 90f 
marginal probability distribution, 
66, 66t 
moments of, 63-65, 63f 
multivariate normal distribution, 
77,79 
normal distributions, 75-79, 75f, 
76f 
of OLS estimators, 162-164, 163f 
sampling distribution, 83-84 
skewness, 63—64, 63fig 
standard normal distributions, 75-79, 
75f, 76f 
Student ż distribution, 80 
Dollar/pound exchange rates, 560-561, 
560f 
DOLS (dynamic OLS) estimator, 
665-667 
Double-blind experiments, 480 
Drift, random walk with, 584 
Dummy variables, 186-188. See also 
Binary variables 
Dummy variable trap, 229-231 
Dynamic causal effects. See also Causal 
effects 
ADL model notation, 647-648 
autocorrelated u, standard errors and 
inference, 618 
cumulative dynamic multipliers, 
618-619 
distributed lag model, 614 
distributed lag model, assumptions, 
617-618 
distributed lag model with AR(1) 
errors, 625—627 
distribution of OLS estimator with 
autocorrelated errors, 
620-621 
estimation with strictly exogenous 
regressors, 624-629 
exogeneity, types of, 615-616 
feasible GLS estimator, 629 
generalized least squares (GLS) 
estimator, 628-629 
HAC standard error, 621-624 
infeasible GLS estimator, 628-629 
OLS estimation of ADL model, 
627-628 
overview of, 609-610, 639 
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Dynamic factor model (DFM), 671-676, Estimation of population mean, 104-108 weighted least squares (WLS) 


682 differences-of-means, 121-123 estimator, 195-196, 699-704 
application to U.S. macroeconomic Estimators. See also Instrumental Exact distribution, 85 
data, 676-680, 677t, 6781, 679f, variables (IV) regression; Exactly identified coefficients, defined, 
680t specific estimator names 438 
Dynamic multipliers, 618-619 asymptotic distribution theory and, Exogeneity 
Dynamic OLS (DOLS) estimator, 690-692 defined, 615-616 
665-667 BLUE (Best Linear Unbiased plausibility of, 637-639 
Estimator), 106-107 Exogeneity of instrument, 446-449 
E Cochrane-Orcutt estimator, 629 test of overidentifying restrictions, 
Earnings, consistent estimator, 690-691 448-449 
age and, 129-130, 130f defined, 104 Exogenous variables, 428 
education level and, 192, 193, 193f differences estimator, 476-477 general IV regression model, 438-439 
gender gap, 119-120, 292-296, 294f, differences-in-differences estimator, included exogenous variables, 437 
298-306, 301f, 305¢ 492-494, 493f instrument relevance and, 440-441 
Socioeconomic class gap, 189-190,192 DOLS (dynamic OLS) estimator, Expectation, defined, 60. See also Mean 
Econometrics, definitions and uses, 665-667 Expected value, 60-61 
43 efficient GMM estimator, 739 of Bernoulli random variable, 61 
Economics journals, demand for, feasible GLS estimator, 629 of continuous random variable, 61 
307-309, 307f, 308¢ fixed effects estimator, 388-390 Experimental data, 49. See also Data 
EG-ADF (Engle-Granger Augmented Frisch-Waugh Theorem, 243-244 Experiments. See also Quasi- 
Dickey-Fuller) test, 665, 6657, generalized least squares (GLS) experiments 
666-667 estimator, 628-629 attrition of subjects, 479 
Education level, earnings distributions HAC (heteroskedasticity-and average causal (treatment) effect, 
and, 192, 193, 193f autocorrelation-consistent) 475-476 
Efficiency, 105-108 estimator, 621-624 comparison of observational and 
Efficient GMM estimator, 739 heterogeneous populations, estimates experimental estimates, 488-490 
Eicker-Huber-White standard errors, in, 498-502 double-blind experiments, 480 
191. See also Heteroskedasticity- homoskedasticity-only standard Hawthorne effect, 480 
robust standard errors error, 191, 243 heterogeneous populations, estimates 
Eigenvalues, 751 infeasible GLS estimator, in, 498-502 
Eigenvectors, 751 628-629 overview of, 474-475, 503 
Elasticity, 289 instrumental variable estimators, potential outcomes, causal effects and 
cigarette taxes, effect of, 435-437 494 idealized experiments, 475-477 
demand for economics journals, Lasso, 527-532, 529f, 531f randomized controlled experiment, 
307-309, 307f, 3082 least absolute deviations (LAD) defined, 47 
instrumental variables (IV) regression, estimator, 196 sample size, validity and, 481 
430-432 least squares estimator, 107, test for random receipt of treatment, 
nonlinear regression functions, 141-142 478 
328-329 linear conditionally unbiased treatment protocol, adherence to, 479 
Election results, sampling bias and, estimators, 726-727 validity, threats to, 478-481 
108 multiple regression, OLS estimator Explained sum of squares (ESS), 
Endogenous variables in, 219-222 153-154 
defined, 428, 615 Newey-West variance estimator, Exponential function, 289. See also 
TSLS in general IV regression model, 623 Logarithms 
439-441 nonlinear least squares estimators, External validity, 331, 332-333 
weak instruments and, 445 327 predictions and, 344-345 
Engle, Robert, 669-670, 680-681 ordinary least squares (See Ordinary threats to, 481, 498 
Engle-Granger Augmented Dickey- least squares (OLS) estimator) 
Fuller (EG-ADF) test, 665, 6657, regression discontinuity estimators, 
666-667 495-496 False positive rate, 115 
Entity and time fixed effects regression ridge regression, 524-527, 525f, Fama, Eugene, 681 
model, 371-374 527f Fan chart, 577, 577f, 578 
Equilibrium effects, 481 sample covariance and correlation, F distribution, 80-81, 711 
Error correction term, 663 127-130, 128f, 130f critical values for, A47t-ASOt 
Errors-in-variable bias, 336-339 shrinkage estimator, 521-522 Feasible GLS estimator, 629, 648 
Error term, linear regression, 145-146. standard error of the regression Feasible WLS estimator, 701 
See also Standard error of (SER), 154 Final prediction error (FPE), 574, 759 
regression (SER) two stage least squares (TSLS) Finite kurtosis, 159-160 


Estimate, defined, 104 estimator, 429 Finite-sample distribution, 85 


First differences, 555-558, 556f, 558¢ 
First-order autoregression, 565-567 
First-stage F-statistic, 446 
First-stage regression(s), 440 
Fixed effects 
assumptions, 374-376 
asymptotic distribution, fixed effects 
estimator, 388-390 
time fixed effects, 371-374 
Florida orange crop, temperature effect 
on 
data set, 610-612, 611f, 646 
example analysis, price and cold 
weather, 630-636, 6314, 632f, 
634f, 635f 
Forecast, defined, 48 
Forecast error, 562-563 
Forecasting. See also Prediction 
fan chart, 577, 577f, 578 
final prediction error (FPE), 574 
forecast types and forecast errors, 
562-563 
forecast uncertainty and forecast 
intervals, 576-578 
least squares assumption, multiple 
predictors, 571-573 
mean squared forecast error (MSFE), 
563-565 
MSFE estimation and forecast inter- 
vals, 573-578 
multi-period forecasts, 654-658 
nowcasting, 676 
oracle forecast, 565 
overview of, 554-555, 596, 649, 682 
pseudo out-of-sample forecasts, 
574-576 
root mean squared forecast error 
(RMSFE), 563-565 
Forecast interval, defined, 576 
FPE (final prediction error), 574, 759 
Fraction correctly predicted, 406—407 
Frisch-Waugh Theorem, 243-244 
F-statistic 
defined, 253 
heteroskedasticity-robust F-statistic, 
254-255 
homoskedasticity-only F-statistic, 
254-255 
multiple regression, theory of, 
721-722, 725-726 
OLS distribution derivation, 754-755 
overall regression F-statistic, 255 
weak instruments and, 446 
Functional form misspecification, 336 
Fuzzy regression discontinuity design, 
495-496 


G 
GARCH (generalized ARCH), 669-671 
Gauss-Markov conditions, 208 


Gauss-Markov conditions for multiple 
regression, 726-727 
Gauss-Markov theorem, 191, 194-196, 
726-727 
proof of, 207-210, 755-756 
GDP. See Gross Domestic Product 
(GDP) 

Gender gap in earning, 119-120, 192 
logarithm models for, 292-296, 294f 
nonlinear regression, variable interac- 

tions, 298-306, 301f, 305¢ 

General equilibrium effects, 481 

Generalized ARCH (GARCH), 

669-671 
Generalized least squares (GLS) 
estimator, 628-629 
assumptions of, 729-730 
conditional mean zero assumption, 
730-733 
feasible GLS estimator, 629, 730 
infeasible GLS estimator, 628-629, 
730 
multiple regression, theory of, 
728-733 
Generalized method of moments 
(GMM), 681 
efficiency, proof of, 758 
efficient GMM estimator, 739 
GMM J-statistic, 740 
time series data and, 740-741 

Granger, Clive, 663, 680-681 

Gross Domestic Product (GDP) 
autoregression, 566-567, 568 
break detection, pseudo out-of- 

sample forecasts, 594-595, 595f 
defined, 46, 555 
multi-period forecasts, 654-658 
nonstationarity, trends, 582-589, 587t 
vector autoregression (VAR) model- 
ing, 653 

Growth rates 

time series data, 555-558, 556f, 558t 


H 
HAC. See Heteroskedasticity-and 
autocorrelation-consistent 
(HAC) estimator 
HAC standard error, 621-624 
Hansen, Lars Peter, 681 
Hawthorne effect, 480 
Heckman, James, 414 
Heterogeneous populations, estimates 
in, 498-502 
Heteroskedasticity, 188-192, 189f 
ARCH (autoregressive conditional 
heteroskedasticity), 669-671 
GARCH (generalized ARCH), 
669-671 
linear probability model, 395 
multiple regression model, 219 
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OLS estimator distribution with auto- 
correlated errors, 620-621 
robust standard error formula, 206 
weighted least squares (WLS) estima- 
tor, 195-196, 700-704 
Heteroskedasticity-and autocorrelation- 
consistent (HAC) estimator, 
621-624 
direct multi-period regression, 
657-658 
HAC standard error, 621-624 
Heteroskedasticity-and-autocorrelation- 
robust (HAR) standard errors, 
376 
Heteroskedasticity-robust F-statistic, 
254-255 
validity and, 343-344 
Heteroskedasticity-robust J-statistic, 740 
Heteroskedasticity-robust standard 
errors, 191-192 
asymptotic distributions and, 695-696 
linear probability model, 395 
multiple regression, theory of, 
719-720 
use in linear regression with single 
regressor, 703-704 
Heteroskedasticity-robust t-statistic, 
696-697 
Heteroskedasticity-robust variance 
estimators, 720 
Homoskedasticity, 188-193, 189f, 193f 
multiple regression model, 219, 243 
Homoskedasticity-only F-statistic, 
255-258 
Homoskedasticity-only standard error, 
191-192 
formulas for, 206-207 
multiple regression, theory of, 
724-725 
Homoskedasticity-only t-statistic, 
698-699 
Homoskedastic normal regression 
assumptions, 196-197 
Household earning, 189-190 
Hypothesis tests, 109-117 
acceptance region, 115 
alternative hypothesis, 109 
comparing means from different 
populations, 119-120 
confidence intervals and population 
mean, 117-118 
critical value, 115 
false positive rate, 115 
linear regression with single 
regressor, 178-184 
multiple regression 
joint hypotheses tests, 251-258 
single coefficient, 247-251 
single restriction, multiple 
coefficient tests, 258-259 
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Hypothesis tests (continued) 
nonlinear regression, 287-288 
null hypothesis, 109 
one-sided alternative hypothesis, 

116-117 

population mean, tests about, 179 
power of the test, 115 
prespecified significance level, 114-116 
p-value, 109-111, 111f 
rejection region, 115 
significance level, 115-116 
size of the test, 115 
Student żź distribution, 125-127 
two-sided alternative hypothesis, 109 
type I and II errors, 115 


l 
Idempotent matrix, 751 
Identically distributed, 82-84 
Impact effect, 619 
Imperfect multicollinearity, 230-231 
Included exogenous variables, 437 
Income, distribution in U.K., 72-73, 
72f, 13t 
social class, education, and, 122-123, 
122t 
Independently and identically 
distributed (i.i.d.), 82-84 
Independent variable, 145-146 
Indicator variables, 186-188. See also 
Binary variables 
Infeasible GLS estimator, 628-629 
Infeasible WLS estimator, 700-701 
In-Sample prediction, 155-156 
Instrumental variable estimators, 494 
in matrix form, 733-734 
Instrumental variables, defined, 427 
Instrumental variables estimation of 
treatment effect, 479 
Instrumental variables (IV) regression 
assumptions and sampling 
distribution, 441-442 
endogenous and exogenous variables, 
428 
general IV regression model, 437-444 
general IV regression model, 
relevance of, 440-441 
general IV regression model, validity 
and, 441 
heterogeneous populations, estimates 
in, 500-502 
included exogenous and control 
variables, 438-439 
inference using TSLS estimator, 
442—443 
instrument exogeneity, 446—449 
instrument validity, 454—459 
TV model and assumptions, 428-429 
overview, 427, 459 
terminology, 437—438 


test of overidentifying restrictions, 
448-449 
TSLS (two stage least squares) 
estimator, 429, 434-435 
with control variables, 471-473 
derivation of formula, 466 
large-sample distributions, 467-469 
weak instruments, 445—446, 469-471 
Wright, Philip and Sewell, 430-432, 
447 
Instrument exogeneity condition, 429 
Instrument relevance condition, 429 
Instruments 
defined, 427 
validity in quasi-experiments, 497—498 
Integrated of order d, I(d), 659-662, 661f 
Integrated of order one, /(1), 659-662, 
661f 
Integrated of order zero, (0), 659-662, 
661f 
Interacted regressor, 298-300 
Interaction regression model, 298-300 
Interaction term, 298-300 
Intercept 
linear regression, 145-146 
population regression line, 217-218 
Interest rates 
cointegration and, 663-667 
term spread, 46 
Internal validity, 330-332 
errors-in-variable bias, 336-339 
functional form misspecification, 336 
inconsistency of OLS standard error, 
343-344 
measurement errors, 336-339 
missing data and sample selection, 
339-340 
predictions and, 344-345 
simultaneous causality, 341-343 
threats to, overview, 331-334, 478-481 
threats to, quasi-experiments, 496-498 
Iterated multi-period AR forecasts, 
654-656 
Iterated multi-period VAR forecasts, 
655-656 
IV. See Instrumental variables (IV) 
regression 


J 


Joint hypothesis 
Bonferroni test of, 274-276 
defined, 252 
multiple regression, theory of, 
721-722 
tests of, 251-258 
Jointly stationary, 562 
Joint probability distribution, 65-66, 
66t, 67t 
independent variables, 70 
likelihood function, 405—406 


J-statistic, 449 
asymptotic distribution, proof of, 
756-758 
GMM J-statistic, 740, 758 
heteroskedasticity-robust J-statistic, 740 
homoskedasticity and, 737-738 
null hypothesis and, 453 


K 
Kurtosis, 63f, 64 


L 


Lagged value, 556 
Lag operator, 606 
Lag polynomial, 606, 647-648 
Lags, 555-558, 556f, 558t. See also 
Autoregressive distributed lag 
(ADL) model 
autoregressive-moving average 
(ARMA) model, 607 
distributed lag model, 614 
lag length estimation, 578-582, 580t 
lag length selection, 581-582 
vector autoregression lag lengths, 652 
Lasso (least absolute shrinkage and 
selection operator), 527-532, 
529f, 531f 
LATE (Local average treatment effect), 
500-502 
Law of iterated expectations, 68—69 
Law of large numbers, 85-86 
asymptotic distribution theory and, 
691-692 
Least absolute deviations (LAD) esti- 
mator, 196 
Least squares assumption, 157-161, 159f, 
164 
for causal inference, 176-177 
causal interference with control 
variables, 233-234, 245-246 
first least squares assumption for 
prediction, 519 
forecasting with multiple predictors, 
571-573 
multiple regression, causal inference, 
225-227 
multiple regression, predictions with, 
244-245 
Least squares estimator, 107 See also 
Ordinary least squares (OLS) 
estimator 
causal inference assumption, 156-161, 
159f 
two stage least squares (TSLS) esti- 
mator, 429 
Leptokurtic, 63f, 64 
Likelihood function, 405—406 
Limited dependent variable, 393. See 
also Binary dependent variables, 
regression with 


Linear conditionally unbiased 
estimators, 726-727 
Linear deterministic time trends, 
587-589, 587t 
Linear functions 
random variables, mean and variance, 
62 
Linear-log model, 290-291, 292f 
Linear probability model, 393-397, 394f, 
403 
Linear regression 
binary variables and, 186-188 
causal inference and prediction, 
143-144 
coefficients, estimating of, 147-152, 
1471, 148f, 1S1f 
confidence intervals for regression 
coefficients, 184-186 
constant regressor, 218-219 
constant term, 218-219 
homoskedastic normal regression 
assumptions, 196-197 
least absolute deviations (LAD) 
estimator, 196 
least squares assumptions for causal 
inference, 156-161, 159f, 164, 
176-177 
measures of fit, 153-156 
model for, 144-147, 146f 
multiple regression 
measures of fit, 222-225 
model for, 217-219 
OLS estimator in, 219-222 
omitted variable bias, 211-216, 
242 
ordinary least squares (OLS) 
estimator, 148-152, 151f 
algebraic facts, 175 
derivation of, 172-173 
sampling distribution of, 161-164, 
163f, 173-175 
with small sample size, 196-197 
terminology of, 145-146 
Linear regression, single regressor, 
145-146 
asymptotic distribution, OLS estimator 
and t-statistic, 695-697 
exact sampling distribution, normal 
error distributions, 697-699 
extended least squares assumptions, 
688-689 
hypothesis testing, 178-184 
overview of, 687 
weighted least squares, 699-704 
Local average treatment effect (LATE), 
500-502 
Logarithms, 288-296, 290f, 292f, 
294f 
computing predicted values of Y, 
295-296 


elasticity of demand, 307-309, 307f, 
308t 
linear-log model, 290-291, 292f 
log-linear model, 291-292 
log-log model, 293-294, 294f 
natural logarithm, defined, 289 
percentages and, 289-290 
slopes and elasticities, 328-329 
time series data, 555-558, 556f, 558t 
Logistical regression. See Logit 
regression 
Logistic curve, 325-326, 326f 
Logit regression, 397 
maximum likelihood estimator 
(MLE), 405—406, 423 
measures of fit, 406-407 
multinomial logit models, 426 
nonlinear least squares estimation, 
404-405 
overview, 401-403, 402f 
Log-linear model, 291-292 
Log-log model, 293-294, 294f 
Longitudinal data, 51-52, 52t 
Long-run cumulative dynamic 
multiplier, 619 


M 
Machine learning, 516 
Many-predictor problem, 516-523, 517t 
Marginal probability distribution, 
66, 66t 
Martingale, 583-584 
Massachusetts education data, 346-353, 
3461, 347f, 3492, 3511, 360, 
488-490 
Matrix notation 
addition and multiplication, 750 
covariance matrix, 752 
eigenvalues and eigenvectors, 751 
idempotent matrix, 751 
matrix algebra, summary of, 748-751 
matrix definitions and types, 749 
matrix inverse, 750 
positive definite and semidefinite, 751 
rank, 751 
square root, 751 
trace, 751 
Maximum likelihood estimator (MLE), 
405-406 
for logit model, 423 
for n i.i.d. Bernoulli random variables, 
421-422 
for probit model, 422-423 
pseudo-R?, 423 
McFadden, Daniel, 414 
Mean. See also Expected value 
Bernoulli random variable, 62 
conditional expectation (mean), 
67-68, 70 
defined, 60 
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law of iterated expectations, 68-69 
linear functions of random variables, 
62 
sample average (mean), 82-84 
sums of random variables, 71, 74 
Mean squared forecast error (MSFE) 
estimation of, forecast intervals and, 
573-578 
forecast uncertainty, 576-578 
overview of, 563-565 
Mean squared prediction error (MSPE), 
518 
estimation of, m-fold cross validation, 
522-523 
linear regression estimated by OLS, 
758-759 
Mean vector, defined, 752 
Measurement errors, 336-339 
Measures of fit 
binary dependent variables, 
regression with, 406-407 
fraction correctly predicted, 
406-407 
in multiple regression, 222-225 
pseudo-R?, 406-407 
regression R°, 153-154 
m-fold cross validation, 522-523 
MLE. See Maximum likelihood 
estimator (MLE) 
Moments of a distribution, 63-65, 
63f 
Mortgage lending. 
probit regression, 397-401, 398f 
racial discrimination, questions 
about, 44-45, 407-413, 4087, 4107, 
411t 
Mosteller, Frederick, 543 
MSFE. See Mean squared forecast error 
(MSFE) 
MSPE. See Mean squared prediction 
error (MSPE) 
Multicollinearity, 226, 228-231, 716 
Multinomial logit model, 426 
Multinomial probit model, 426 
Multi-period forecasts, 654-658 
Multiple regression. See also Binary 
dependent variables, regression 
with; Multiple regression, 
theory of; Nonlinear regression 
functions 
adjusted R°, 223-225 
confidence sets for multiple 
coefficients, 259-260, 260f 
control variables and conditional 
mean, 231-234, 245-246 
dummy variable trap, 229-230 
Frisch-Waugh Theorem, 243-244 
HAC standard error, 623-624 
interactions between variables, 
306 
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Multiple regression (continued) 
joint hypotheses, tests of, 251-258, 
274-276 
least squares assumption, causal 
inference and, 225-227 
least squares assumption, predictions 
with, 244-245 
model of, 217-219 
model specification guidelines, 
260-262 
OLS estimator, 219-222 
OLS estimator, distribution of, 
227-228 
perfect multicollinearity, 226-227, 
228-231 
R? and adjusted R° interpretation, 
262, 263 
regression R? defined, 223 
single coefficient, hypothesis tests, 
247-251 
single restriction, multiple coefficient 
tests, 258-259 
standard error of regression (SER), 
222-223 
Multiple regression, theory of 
asymptotic distribution of t-statistic, 
720 
asymptotic normality of OLS 
estimator, 718-719 
confidence intervals, predicted values, 
720 
confidence sets for multiple coef- 
ficients, 722 
extended least squares assumptions, 
715-716 
Gauss-Markov conditions for 
multiple regression, 726-727 
Gauss-Markov theorem, proof of, 
755-756 
generalized least squares, 728-733 
heteroskedasticity-robust standard 
errors, 719-720 
joint hypothesis tests, 721-722 
matrix notation of multiple 
regression model, 714-715 
multivariate central limit theorem, 
718 
OLS estimator, 716-717 
overview, 713-714 
regression statistic distributions, 
normal errors, 722-726 
TSLS (two stage least squares) 
estimator 
asymptotic distribution, 734-735 
homoskedastic errors, 735-738 
matrix form, 734 
Multiple regression model with control 
variables, 233-234, 245-246 
Multi-step ahead forecasts, 562-563 
Multivariate central limit theorem, 718 


Multivariate distributions, 752-753 
Multivariate normal distribution, 77, 
79, 752 


N 
National Statistics Socio-economic 
Classification (NS-SEC), 72 
Natural experiments, 490. See also 
Quasi-experiments 
Natural logarithm, 289. See also 
Logarithms 
Negative exponential growth, 326, 328f 
Newey, Whitney, 623 
Newey- West variance estimator, 623 
Nonlinear least squares, 327 
estimation and inference, logit and 
probit models, 404-405 
Nonlinear least squares estimators, 327 
Nonlinear regression functions 
changes in X and Y, 282-283 
cubic regression model, 287-288 
general functions with nonlinear 
parameters, 326-327 
interactions between variables, 297 
continuous and binary variable, 
300-303, 301f 
two binary variables, 298-300 
two continuous variables, 305-309 
interpreting coefficients in, 285 
logarithms, 288-296, 290f, 292f, 294f 
logistic curve, 325-326, 326f 
logit (logistical) regression, 397, 
401-403, 402f 
modeling strategies, 279-286, 279f, 
281f, 285-286 
negative exponential growth, 326 
nonlinear least squares estimation, 
327 
overview, 277-278, 278f 
polynomial regression model, 286-288 
probit regression, 397-401, 398f 
quadratic regression model, 279f, 
280-281, 281f 
slopes and elasticities, 328-329 
standard errors of estimated effects, 
284-285 
Nonrandom regressors, 158-159 
Nonrepresentative samples, 481 
Nonsingular matrix, 750 
Nonstationarity 
breaks, 589-596, 5911, 593f, 595f 
trends, 582-589, 587t 
unit root tests, nonnormal 
distributions, 661-662 
Nonstationary, defined, 562 
Normal distributions, 75-79, 75f, 76f 
multivariate normal distribution, 77, 
79 
Normal probability density function 
(p.d.f£.), 710 


Normal random variables 
linear combination and quadratic 
forms, 752-753 
Nowcasting, 676 
Null hypothesis, 109 
comparing means from different 
populations, 119-120 
false positive rate, 115 
hypothesis testing about slope, 
180-181 
joint null hypotheses, 252-258 
J-statistic and, 453 
prespecified significance level, 
114-116 


(6) 
Observational data, 49 
Observation number, defined, 50 
OLS. See Ordinary least squares (OLS) 
estimator 
OLS regression line, 220-222 
OLS residual, 220-222 
Omitted variable bias, 211-216, 242, 262, 
334-336 
One-sided alternative hypothesis, 116-117 
One-step ahead forecasts, 562-563 
Oracle forecast, 565 
Oracle predictor, 518 
Orange juice 
example analysis, price and cold 
weather, 630-636, 6311, 632f, 
634f, 635f 
Florida orange crop data set, 610-612, 
611f, 646 
Ordered response regression models, 425 
Orders of integration, 658-662, 661f 
Ordinary least squares (OLS) estimator, 
148-152, 151f. See also 
Instrumental variables (IV) 
regression 
adjusted R°, 223-225 
algebraic facts about, 175 
asymptotic distributions and, 695, 753 
autocorrelated u, standard errors and 
inference, 618 
derivation of, 172-173 
derivation of, k=1,551 
distributions of test statistics, 
derivations of, 754-755 
DOLS (dynamic OLS) estimator, 
665-667 
Frisch-Waugh Theorem, 243-244 
Gauss-Markov theorem for multiple 
regression, 726-727 
heterogeneous populations, estimates 
in, 498-502 
homoskedasticity, 190-191, 243 
hypothesis tests about mean and 
slope, 181-182 
Lasso, 528-532, 529f, 531f 


linear probability model, 395 
many-predictor problem and, 
516-523, 517t 
MSPE for linear regression and, 
758-759 
multiple regression, 219-222 
least squares assumptions, 225-227, 
715-716 
multicollinearity, 228-231 
OLS distribution, 227-228 
standard errors, 247-248 
theory of, 716-719, 723-724 
OLS regression line, 220-222 
OLS residual, 220-222 
predictions with, 155-156 
regression R?, defined, 223 
ridge regression, 524—527, 525f, 527f 
sampling distribution, 161-164, 163f, 
173-175 
shrinkage estimator and, 521-522 
single regressors, extended least 
squares assumptions, 688-689 
standard error of regression, 211-216 
stochastic trends, problems caused by, 
585-586 
theoretical foundation, 194-196, 
207-210 
time series data, autocorrelated 
errors, 620-621 
validity, inconsistency of OLS 
standard error, 343-344 
in vector autoregression (VAR), 
650-651 
weighted least squares (WLS) 
estimator, 699-704 
Ordinary least squares (OLS) regression 
line, 149-152, 151f 
Outcomes, defined, 56 
Outliers 
kurtosis and, 63f, 64 
least squares assumptions and, 
159-160 
Out-of-sample prediction, 155-156 
computation of, 552-553 
pseudo out-of-sample forecasts, 
574-576 
Overidentified coefficients, 438 
test of overidentifying restrictions, 
448-449 


P 
Panel data 
before and after comparisons, 
365-367, 366f 
asymptotic distribution, fixed effects 
estimator, 388-390 
balanced panel, 362 
defined, 51-52, 52t, 362 
fixed effects regression assumptions, 
374-376 


regression with fixed time effects, 
371-374 
standard errors for fixed effect 
regression, 376 
unbalanced panel, 362 
Parameters, linear regression, 145-146 
Partial compliance, 479 
Partial effect, 218 
Pattern recognition, 516 
p.d.f. (probability density function), 58, 
59f 
Penalized sum of squared residuals, 
524-527, 525f, 527f 
Percentages, logarithms and, 289-290 
Perfect multicollinearity, 226-227, 
228-231 
Polynomial regression model, 286-288, 
296-297, 297f 
Pooled standard error formula, 125-127, 
197 
Population mean 
comparing means from different 
populations, 119-120 
confidence intervals for, 117-118 
hypothesis testing, 109-117, 179 
Population multiple regression model, 
218-219 
Population regression line (function), 
145-146, 217-218 
Populations. See also Sampling 
attrition of subjects, 479 
heterogeneous populations, estimates 
in, 498-502 
simple random sampling, 81-82 
Positive definite matrix, 751 
Positive semidefinite matrix, 751 
Potential outcomes 
causal effects and, 475-477 
defined, 475 
Power, hypothesis testing, 115 
Predicted value, 149, 150, 220-222 
Prediction. See also Dynamic causal 
effects; Forecasting 
defined, 48 
first least squares assumption for 
prediction, 519 
internal and external validity, 344-345 
Lasso, 527-532, 529f, 531f 
many-predictor problem and OLS, 
516-523, 517t 
mean squared prediction error 
(MSPE), 518 
oracle predictor, 518 
with ordinary least squares (OLS) 
estimator, 155-156 
overview of, 514-515, 542-544 
principal components, 532-537, 533f, 
536f, 537f 
ridge regression, 524-527, 525f, 527f 
shrinkage estimator, 521-522 
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sparse model, 528 
standardized predictive regression 
model, 519-521 
Price, inflation rate and, 660-661, 661f 
Price elasticity of demand, 45 
Principal components, 532-537, 533f, 
536f, 537f 
formulas for, 761-762 
scree plot, 534-535, 536f 
Probability density function (p.d.f.), 58, 
59f, 710 
Probability distributions. See also 
Statistics 
asymptotic distribution, 85 
Bayes’ rule, 69 
Bernoulli distribution, 58 
bivariate normal distribution, 
77,79 
chi-squared distribution, 80 
conditional distributions, 66—67, 67t 
of continuous random variable, 
58, 59f 
cumulative probability distribution, 
57, S7ft 
defined, 56 
of discrete random variable, 56-58, 
57ft 
F distribution, 80-81 
finite-sample distribution, 85 
independent variables, 70 
joint probability distribution, 65—66, 
66t, 67t 
kurtosis, 63f, 64 
large-sample approximations, 85-90, 
87f, 88f, 90f 
marginal probability distribution, 
66, 66t 
moments of a distribution, 63-65, 
63f 
multivariate normal distribution, 77, 
79 
normal distributions, 75-79, 75f, 76f 
skewness, 63—64, 63fig 
standard deviation and variance, 
61-62 
Student ż distribution, 80 
Probit regression, 397—401, 398f 
maximum likelihood estimator 
(MLE), 405-406, 422-423 
measures of fit, 406—407 
multinomial probit models, 426 
nonlinear least squares estimation, 
404-405 
ordered probit model, 425 
Program evaluation, 474. See also 
Experiments; Quasi-experiments 
Project STAR, 482-490, 4841, 4851, 487¢, 
4891, 510 
Pseudo out-of-sample forecasts, 
574-576 
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Pseudo-R’, 406-407, 423 
p™ -order autoregressive [AR(p)] 
model, 567—568 
p-value, 109-111, 111f 
F-statistic and, 254-255 
hypothesis testing about population 
mean, 179 
hypothesis testing about slope, 
180-181 
two-sided tests, 182, 182f 


Q 


Quadratic forms, normal random vari- 
ables, 752-753 
Quadratic regression model, 279f, 
280-281, 281f 
Quandt likelihood ratio (QLR) statistic, 
590-593, 5911, 593f 
Quasi-difference, 626 
Quasi-experiments. See also 
Experimental data 
defined, 490 
differences-in-differences estimator, 
492-494, 493f 
heterogeneous populations, estimates 
in, 498-502 
instrumental variable estimators, 
494 
overview of, 474-475, 503 


potential outcomes and causal effects, 


475-477 
regression discontinuity estimators, 
495-496 
repeated cross-sectional data, 494 
validity, external threats, 481, 498 
validity, internal threats, 478-481, 
496-498 


R 
Racial discrimination in mortgage 
lending, 44-45, 407-413, 4087, 
4102, 411 
Randomization, validity and, 478, 
496-497 
Randomization based on covariates, 
477 
Randomized controlled experiment. 
See also Experiments; Quasi- 
experiments 
causal and treatment effects, 121-123 
conditional mean, 157-158 
overview of, 47—48 
time series data and, 613 
Random sampling, 81-82. See also 
Sampling 
Random variables 
Bernoulli random variable, 58 
bivariate normal distribution, 77, 79 
chi-squared distribution, 80 
conditional distributions, 66-67, 67t 


conditional expectation (mean), 
67-68, 70 
conditional variance, 69 
covariance and correlation, 
70-71 
defined, 56 
expected value, 60-61 
F distribution, 80-81 
independent variables, 70 
joint probability distribution, 65—66, 
66t 
kurtosis, 63f, 64 
law of iterated expectations, 68-69 
law of large numbers, 85-86 
marginal probability distribution, 
66, 66t 
mean and variance, linear functions, 
62 
mean and variance, sums of variables, 
71,74 
moments of distribution, 63-65, 
63f 
multivariate normal distribution, 77, 
79 
normal distributions, 75-79, 75f, 76f 
skewness, 63-64, 63f 
standard deviation and variance, 
61-62 
Student ż distribution, 80 
Random walk, 583-584, 659 
Rank of matrix, 751 
Realized volatility, 668-669, 669f 
Reduced form, 439 
Regression 
autoregression, 565-568 
binary dependent variables and 
linear probability model, 393-397, 
394f 
logit (logistical) regression, 397 
maximum likelihood estimator 
(MLE), 405-406 
measures of fit, 406-407 
nonlinear least squares estimation, 
404-405 
overview, 392-393, 413-414 
probit regression, 397-401, 398f 
censored regression models, 424 
count data, 425 
cubic regression model, 287-288 
discrete choice data, 426 
instrumental variables (See 
Instrumental variables (IV) 
regression) 
linear (See Linear regression; Linear 
regression, single regressor) 
multiple (See Multiple regression; 
Multiple regression, theory of) 
nonlinear regression (See Nonlinear 
regression functions) 
ordered response models, 425 


polynomial regression model, 
286-288 
quadratic regression model, 279f, 
280-281, 281f 
ridge regression, 524-527, 525f, 527f 
sample selection models, 424-425 
spurious regression, 584-586 
standardized predictive regression 
model, 519-521 
Tobit regression, 424 
truncated regression models, 
424-425 
vector autoregression (VAR), 
649-653 
Regression discontinuity, 495—496 
Regression R?, 153-154 
defined, 223 
interpretation of, 262, 263 
Regressor, 145-146 
multicollinearity, 228-231 
Rejection region, 115 
Relevance of instrument 
general IV regression model, 440-441 
instrumental variables (IV) regression, 
444-446 
Repeated cross-sectional data, 494 
Residual, 149, 150 
Restricted regression, 256 
single restriction, multiple coefficient 
tests, 258-259 
Restrictions, 252 
Ridge regression estimator, 524-527, 
525f, 527f 
derivation of, 759-761 
precautions about, 530-531 
Risk, measures of, 152 
River of blood, inflation forecasts, 577, 
577f, 578 
RMSFE. See Root mean squared 
forecast error (RMSFE) 
Roll, Richard, 636 
Root mean squared forecast error 
(RMSFE), 563-565 
forecast uncertainty, 576-578 
Row vector, 748-749 
r” moment, 65 


S 

Sample average (mean), 82-84 

Sample correlation, 127-130, 128f, 130f 

Sample correlation coefficient, 127-130, 
128f, 130f 

Sample covariance, 127-130, 128f, 130f 

Sample regression function, 149-152, 
151f 

Sample regression line, 149-152, 151f 

Sample selection bias, 340, 414 

Sample selection regression models, 
424-425 

Sample space, 56 


Sample standard deviation, 111-113 
Sample variance, 111-113 
consistency, 141-142 
Sampling distribution, 83-84 
Sargent, Thomas, 681 
Scalar, defined, 749 
Scatterplots, 127-130, 128f, 130f 
Schwartz information criterion (SIC), 
579 
Scree plot, 534-535, 536f, 674-675 
Second difference, 659 
Second-stage regression(s), 440 
Serial correlation, 558-559 
Sharp regression discontinuity design, 
495-496 
Shea, Dennis, 124 
Shiller, Robert, 681 
Shrinkage estimator, 521-522 
Lasso, 528-532, 529f, 531f 
ridge regression, 524-527, 525f, 527f 
Significance level, hypothesis testing 
and, 114-116 
Significance probability, 109-111, 111f 
Sims, Christopher, 652, 681 
Simple random sampling, 81-82 
Simultaneous causality, 341-343 
Simultaneous equations bias, 342-343 
Size of test, hypothesis testing, 115 
Skewness, 63-64, 63fig 
Slope 
hypothesis testing about, 180-182 
linear regression, 145-146, 149-151, 
151f 
nonlinear regressions, 277, 278f, 
328-329 (See also Nonlinear 
regression functions) 
one-sided hypothesis tests, 
182-184 
ordinary least squares (OLS) 
estimators, 149-151, 151f 
population regression line, 217-218 
Slutsky’s theorem, 693—694 
Socioeconomic class, household 
earnings by 189-190 
Smoking. See Cigarette taxes 
Sparse model, 528-532, 529f, 531f 
Spurious regression, 584-586 
Square matrices, 749 
Square root of matrix, 751 
Standard deviation. See also Statistics 
defined, 61 
sampling distribution, estimators for, 
179 
Standard error 
clustered standard errors, 376 
direct multi-period regression, 
657-658 
dynamic causal effects and, 618 
fixed effects regression errors, 376 
HAC standard error, 621-624 


heteroskedasticity-and-autocorrelation- 


robust (HAR) standard errors, 
376 

heteroskedasticity-robust standard 
errors, 191-192, 206 

homoskedasticity, 188-193, 189f, 193f, 
243 

homoskedasticity, error formulas, 
206-207 

homoskedasticity-only standard 
error, 191-192 

linear probability model, 395 

multiple regression, 222-223, 224 

nonlinear regression, estimated 
effects, 284-285 

for predicted probabilities, MLE and, 
423 

TSLS (two stage least squares) 
estimator, 442-443, 735 

validity, inconsistency of OLS 
standard error, 343-344 

Standard error of regression (SER), 154 

and mean square forecast error 

(MSFE), 573-574 
Standard error of sample average, 

111-113 


pooled standard error formula, 125-127 


consistency, 141-142 
Standardization, 65 
Standardized predictive regression 
model, 519-521 
Standardized random variables, 65 
Standard normal distributions, 75-79, 
75f, 76f 
values for, A43t-A44t 
Stationarity, 561-562, 572 
in autoregressive model, 605—606 
in autoregressive-moving average 
(ARMA) model, 607 
Stochastic trends, 583, 584 
cointegration, 663-667, 665t 
common trend, 663 
detection and avoidance of, 586-589, 
587t 
orders of integration and unit root 
tests, 658-662, 661f 
problems caused by, 584-586 
Stock market 
beating the market, 563-565 
capital asset pricing model (CAPM), 
152 
diversification and risk, 84 


forecasting with macroeconomic data, 


676-680, 677t, 678t, 679, 680t 
performance of funds and market, 341 
probability distributions, market 

swings, 77-79, 78f 
realized volatility, 668-669, 669f 
volatility clustering, 561, 667—668, 

667f 
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Wilshire 5000 Total Market Index, 
560f, 561, 667-669, 667f, 669f 
Strict exogeneity, 615-616 
Structural VAR modeling, 652, 681 
Student tf distribution, 80, 125-127, 711 
critical values for, A45t 
small sample size and, 197 
Student-teacher ratio and test scores 
California school testing data, 49-50, 
49t, 516-523, 517t 
Lasso prediction model, 531-532, 531f 
Massachusetts data, 346-353, 346r, 
347f, 3491, 3511, 360 
Project STAR, Tennessee, 482-490, 
4841, 485t, 487t, 4892, 510 
Sum of squared residuals (SSR), 
153-154 
Sup-Wald statistic, 590-593, 5911, 593f 
Survivorship bias, 341 
Symmetric matrices, 749 


-| 


Tarrifs, instrumental variables (IV) 
regression, 430-432 
Taxes. See Cigarette taxes 
t distribution, 80 
Tennessee, Project STAR, 482-490, 4847, 
485t, 487t, 4891, 510 
Term spread, 47 
GDP growth forecasts, 568-570, 569f, 
638 
vector autoregression (VAR) 
modeling, 653 
Test for random receipt of treatment, 
478 
Test for the difference between two 
means, 119-120 
Test of overidentifying restrictions, 
448-449 
Test power, hypothesis testing, 115 
Test size, hypothesis testing, 115 
Test statistic, 113-114 
Text data, 516, 543 
Thaler, Richard, 124 
Time fixed effects regression model, 
371-374 
Time series data, 159. See also Dynamic 
causal effects; Time series 
regression 
autocorrelation (serial correlation) 
and autocovariance, 558-559 
central limit theorem and, 693 
defined, 50-51, 51t 
generalized method of moments 
(GMM), 740-741 
law of large numbers and, 693 
OLS estimator distribution with 
autocorrelated errors, 620-621 
as randomized controlled 
experiments, 613 
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Time series regression 
Akaike information criterion (AIC), 
579-581, 608 
ARCH (autoregressive conditional 
heteroskedasticity), 669-671 
autoregressions, 565-568 
autoregressive distributed lag (ADL) 
model, 570-571 
autoregressive-moving average 
(ARMA) model, 607 
Bayes information criterion (BIC), 
579, 580t, 581, 607-608 
cointegration, 663-667, 665t 
dynamic factor model (DFM), 
671-676 
final prediction error (FPE), 574 
forecast uncertainty and forecast 
intervals, 576-578 
GARCH (generalized ARCH), 
669-671 
generalized method of moments 
(GMM) and, 740-741 
lag length estimation, 578-582, 580t 
lag length selection, 581-582 
lag operator notation, 606 
lags, first differences, logarithms, and 
growth rates, 555-558, 556f, 558¢ 
least squares assumption, multiple 
predictors, 571-573 
mean squared forecast error (MSFE), 
563-565 
MSFE estimation and forecast 
intervals, 573-578 
multi-period forecasts, 654-658 
nonstationarity, breaks, 589-596, 591t, 
593f, 595f 
nonstationarity, trends, 582-589, 587t 
nowcasting, 676 
orders of integration and unit root 
tests, 658-662, 661f 
overview of, 554-555, 596, 649, 682 
pseudo out-of-sample forecasts, 
574-576 
root mean squared forecast error 
(RMSFE), 563-565 
spurious regression, 584-586 
stationarity, 561-562, 605-606 
stochastic trends, 583, 584 
detection and avoidance of, 
586-589, 587t 
problems caused by, 584-586 
unit root, 584 
vector autoregression (VAR), 
649-653 
Tobin, James, 424 
Tobit regression, 424 
Trace of matrix, 751 
Traffic deaths and alcohol taxes, 275f, 
362-365 
Transpose, matrices, 749 


t-ratio, 113-114 
Treatment effect, 121-123 
instrumental variables estimation 
of, 479 
local average treatment effect 
(LATE), 500-502 
Treatment group 
defined, 47-48 
repeated cross-sectional data, 494 
Treatment protocol, validity and, 479, 
497 
Trends, 582-589, 587t 
cointegration, 663-667, 665t 
common trend, 663 
deterministic trends, 582-583 
orders of integration and unit root 
tests, 658-662, 661f 
random walk, 583-584 
stochastic trends, 583, 584 
detection and avoidance of, 
586-589, 587t 
problems caused by, 584-586 
Truncated regression models, 424-425 
Truncation parameter, HAC, 622-623 
TSLS. See Two stage least squares 
(TSLS) estimator 
t-statistic, 113-114 
asymptotic distributions and, 
696-697, 720 
central limit theorem and, 694 
comparing means from different 
populations, 119-120 
confidence intervals and population 
mean, 118 
general form of, 179 
homoskedasticity-only t-statistic, 
698-699 
hypothesis testing about population 
mean, 179 
hypothesis testing about slope, 
180-181 
multiple regression, theory of, 725 
with small sample size, 123, 125-127, 
196-197 
stochastic trends, problems caused by, 
585-586 
Student f distribution, 125-127 
Two-sided alternative hypothesis, 109 
hypothesis testing about slope, 
180-181 
Two stage least squares (TSLS) 
estimator, 429 
asymptotic distribution of, 734-735 
with control variables, 471-473 
derivation of formula, 466-467 
first- and second-stage regressions, 
440 
general IV regression model, 439-440 
homoskedastic errors, 735-738 
inference and, 442-443 


instrument exogeneity and, 446-449 
IV regression sampling distribution, 
441-442 
large-sample distribution, 467-469 
local average treatment effect 
(LATE), 500-502 
matrix form, 734 
standard errors for, 735 
weak instruments and, 445—446 
Type I error, 115 
Type II error, 115 


U 
Unbalanced panel, 362 
Unbiased estimators, 104-108 
Unconfoundedness, 513 
Uncorrelated variables, 71 
Underidentified coefficients, 438 
Unemployment rates, 560, 560f, 702 
Unit root, 584 

cointegration, 664—665 

orders of integration and nonnormality 

of tests, 658-662, 661f 

Unrestricted regression, 256 


V 
Validity 
external validity, 331, 332-333 
general IV regression model, 441 
Hawthorne effect, 480 
instrumental variables (IV) regres- 
sion, 444-449, 454-459 
internal validity, 330-332 
internal validity, threats to, 331-334, 
478-481, 496-498 
errors-in-variable bias, 336-339 
functional form misspecification, 336 
inconsistency of OLS standard 
error, 343-344 
measurement errors, 336-339 
missing data and sample selection, 
339-340 
omitted variable bias, 334-336 
simultaneous causality, 341-343 
predictions and, 344-345 
VAR. See Vector autoregression (VAR) 
Variables. See also Statistics; specific 
variable names 
Bernoulli random variable, 58 
binary variables, 186-188 
constant regressor, 218-219 
constant term, 218-219 
continuous random variables, 56 
control variable, 231-232 
dependent variable, 145-146 
discrete random variables, 56 
dummy variables, 186-188 
endogenous variables, 428 
exogenous variables, 428 
included exogenous variables, 437 


independently distributed 
(independent) variables, 70 

independent variable, 145-146 

indicator variables, 186-188 

standardized random variables, 65 

Variance 

of Bernoulli random variable, 62 

conditional variance, 69 

defined, 61 

of estimators, 104-108 

homoskedasticity, 188-193, 189f, 193f, 
243 

linear functions of random variables, 
62 

sample average (mean), 82-84 

sums of random variables, 71, 74 

volatility clustering, 668 

Vector autoregression (VAR), 682 

causal analysis with, 652 

inference in, 650-651 

iterated multivariate forecasts, 
655-656 

lag length determination, 652 


model of, 649-650 
structural VAR modeling, 652 
Vector error correction model (VECM), 
663 
Vectors. See also Matrix notation 
definitions and types, 748-749 
eigenvectors, 751 
multivariate distributions, 752—753 
Volatility 
ARCH (autoregressive conditional 
heteroskedasticity), 680-681 
GARCH (generalized ARCH), stock 
market example, 670-671, 681 
realized volatility, 668-669, 669f 
volatility clustering, 561, 667—668 


Ww 
Wages. See also Earnings, distribution 
in US. 
Wallace, David, 543 
Weak dependence, 572 
Weak instruments 
checking for, 446 
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defined, 445 
instrumental variable analysis, 
469-471 
problems with, 445 
Weighted least squares (WLS) estimator, 
195-196 
feasible WLS, 701 
infeasible WLS, 700 
linear regression, one regressor, 
699-704 
West, Kenneth, 623 
Wilshire 5000 Total Market Index, 560f, 
561, 667-669, 667f, 669f 
GARCH (generalized ARCH), 
670-671 
WLS. See Weighted least squares (WLS) 
estimator 
Wold decomposition theorem, 607 
Wright, Philip G., 430-432, 447 
Wright, Sewell, 430-431, 447 


Z 


Zero-period dynamic multiplier, 619 


i 


Large-Sample Critical Values for the t-statistic from the Standard 


Normal Distribution 


2-Sided Test ( + ) 


Reject if |t| is greater than 


1-Sided Test (>) 


Reject if t is greater than 


1-Sided Test (<) 


Reject if fis less than 


10% 


1.64 


1.28 


-1.28 


Significance Level 


5% 


1.96 


1.64 


-1.64 


1% 


2.58 


2.33 


-2.33 l 


[ Large-Sample Critical Values for the F-statistic from the Fm, .. Distribution ) 


Reject if F > Critical Value 
Significance Level 

Degrees of Freedom (m) 10% 5% 1% 
1 2.71 3.84 6.63 
2 2.30 3.00 4.61 
3 2.08 2.60 3.78 
4 1.94 2:37 3.32 
5 1.85 2:21 3.02 
6 1.77 2.10 2.80 
7 1.72 2.01 2.64 
8 1.67 1.94 2.51 
9 1.63 1.88 2.41 
10 1.60 1.83 2.32 
11 1.57 1.79 2.25 
12 1.55 1.75 2.18 
13 152 1.72 2.13 
14 1.50 1.69 2.08 
15 1.49 1.67 2.04 
16 1.47 1.64 2.00 
17 1.46 1.62 1.97 
18 1.44 1.60 1,93 
19 1.43 1.59 1.90 
20 1.42 1:57 1.88 
21 1.41 1.56 1.85 
22 1.40 1.54 1.83 
23 1.39 1.53 1.81 
24 1.38 1.52 1.79 
25 1.38 1.51 1.77 
26 1.37 1.50 1.76 
27 1.36 1.49 1.74 
28 1.35 1.48 1.72 
29 1.35 1.47 1.71 


1.34 1.46 1.70 
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